-
Notifications
You must be signed in to change notification settings - Fork 376
Open
Labels
type:bugSomething isn't workingSomething isn't working
Description
What happened?
When AMS is restarted (e.g., during a Kubernetes rolling update), in-flight Optimizing Processes for tables never complete. The table remains stuck in MAJOR_OPTIMIZING/MINOR_OPTIMIZING/FULL_OPTIMIZING status permanently, preventing any new optimization from being scheduled.
Expected: After AMS restart, SCHEDULED/ACKED tasks should be automatically reset to PLANNED and re-queued, allowing the Optimizing Process to complete normally.
Actual: SCHEDULED/ACKED tasks are loaded into taskMap but never placed into taskQueue, leaving them permanently orphaned.
Affects Versions
master
What table formats are you seeing the problem on?
Iceberg, Mixed-Iceberg, Paimon, Mixed-Hive
What engines are you seeing the problem on?
AMS, Optimizer
How to reproduce
- Register a table in AMS and start an Optimizer
- Insert data into the table to trigger self-optimizing
- While the Optimizer is executing tasks (tasks in SCHEDULED or ACKED state), restart the AMS process
- After AMS restarts, the table's Optimizing Process never completes and the table stays in
MAJOR_OPTIMIZINGstatus
Relevant log output
No error logs are generated — the tasks silently remain in ACKED/SCHEDULED state without any retry or timeout warning.Anything else
Timeline during AMS restart:
T0: AMS Pod Terminating (rolling update starts)
- Optimizer is executing Task T1 (ACKED state)
T1: AMS Pod Down
- Optimizer completes T1 → completeTask() fails (AMS unavailable)
T2: New AMS Pod Starting
- Loads old optimizer record from DB (token-A, stale touchTime)
- registerOptimizer(token-A) → added to authOptimizers
- loadTaskRuntimes: T1 loaded as ACKED into taskMap only (not re-queued)
T3: OptimizerKeeper processes expired optimizer
- collectTasks: token-A still in authOptimizers → T1 NOT detected
- unregisterOptimizer(token-A) → token-A removed
T4: Task T1 permanently stuck in ACKED state
- No more keeper events (suspendingQueue is empty)
- allTasksPrepared() → false → table stuck forever
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
type:bugSomething isn't workingSomething isn't working