Skip to content

[AMORO-4089] Fix optimizing process stuck permanently after AMS restart#4090

Open
j1wonpark wants to merge 1 commit intoapache:masterfrom
j1wonpark:fix/AMORO-4089-optimizing-process-stuck-after-ams-restart
Open

[AMORO-4089] Fix optimizing process stuck permanently after AMS restart#4090
j1wonpark wants to merge 1 commit intoapache:masterfrom
j1wonpark:fix/AMORO-4089-optimizing-process-stuck-after-ams-restart

Conversation

@j1wonpark
Copy link
Contributor

@j1wonpark j1wonpark commented Feb 15, 2026

Why are the changes needed?

When AMS restarts (e.g., during a Kubernetes rolling update), in-flight optimizing tasks in SCHEDULED or ACKED state are loaded into taskMap but never placed into taskQueue in loadTaskRuntimes(). Since the connection to the original Optimizer is lost after restart, these tasks can never be completed, causing the table to remain stuck in MAJOR_OPTIMIZING/MINOR_OPTIMIZING status permanently.

Additionally, OptimizerKeeper has a race condition where it scans for orphaned tasks before removing the expired optimizer's token from authOptimizers. This causes the predicate !activeTokens.contains(task.getToken()) to evaluate to false, so orphaned tasks are never detected. After the token is removed, no further keeper events are generated, leaving the tasks permanently stuck.

resolve #4089

Brief change log

  • Reset SCHEDULED/ACKED tasks to PLANNED and re-queue them during recovery in OptimizingQueue.loadTaskRuntimes()
  • Fix race condition in DefaultOptimizingService.OptimizerKeeper.run() by moving unregisterOptimizer() before collectTasks(), so expired tokens are removed from authOptimizers before orphaned task detection
  • Update testTouchTimeout, testReloadScheduledTask, and testReloadAckTask to reflect the new recovery behavior

How was this patch tested?

  • Add some test cases that check the changes thoroughly including negative and positive cases if possible
  • Add screenshots for manual tests if appropriate
  • Run test locally before making a pull request

Documentation

  • Does this pull request introduce a new feature? (yes / no)
    no
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)
    not applicable

When AMS restarts, SCHEDULED/ACKED tasks were loaded into taskMap but
never re-queued, causing the optimizing process to hang forever.

This fix addresses two issues:

1. Reset SCHEDULED/ACKED tasks to PLANNED during recovery in
   loadTaskRuntimes(), so they are re-queued for execution.

2. Fix race condition in OptimizerKeeper where expired optimizer tokens
   were still in authOptimizers during task scanning. Move
   unregisterOptimizer() before collectTasks() so orphaned tasks are
   correctly detected.

Signed-off-by: Jiwon Park <jpark92@outlook.kr>
@github-actions github-actions bot added the module:ams-server Ams server module label Feb 15, 2026
@j1wonpark
Copy link
Contributor Author

cc @klion26 @czy006 Could you please review this PR? This fix is closely related to #4043 (table runtime status correction after AMS restart) and addresses the case where SCHEDULED/ACKED tasks become permanently orphaned after a restart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ams-server Ams server module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Optimizing process stuck permanently after AMS restart

1 participant