Skip to content

[Bug]: Optimizing process stuck permanently after AMS restart #4089

@j1wonpark

Description

@j1wonpark

What happened?

When AMS is restarted (e.g., during a Kubernetes rolling update), in-flight Optimizing Processes for tables never complete. The table remains stuck in MAJOR_OPTIMIZING/MINOR_OPTIMIZING/FULL_OPTIMIZING status permanently, preventing any new optimization from being scheduled.

Expected: After AMS restart, SCHEDULED/ACKED tasks should be automatically reset to PLANNED and re-queued, allowing the Optimizing Process to complete normally.

Actual: SCHEDULED/ACKED tasks are loaded into taskMap but never placed into taskQueue, leaving them permanently orphaned.

Affects Versions

master

What table formats are you seeing the problem on?

Iceberg, Mixed-Iceberg, Paimon, Mixed-Hive

What engines are you seeing the problem on?

AMS, Optimizer

How to reproduce

  1. Register a table in AMS and start an Optimizer
  2. Insert data into the table to trigger self-optimizing
  3. While the Optimizer is executing tasks (tasks in SCHEDULED or ACKED state), restart the AMS process
  4. After AMS restarts, the table's Optimizing Process never completes and the table stays in MAJOR_OPTIMIZING status

Relevant log output

No error logs are generated — the tasks silently remain in ACKED/SCHEDULED state without any retry or timeout warning.

Anything else

Timeline during AMS restart:

T0: AMS Pod Terminating (rolling update starts)
    - Optimizer is executing Task T1 (ACKED state)

T1: AMS Pod Down
    - Optimizer completes T1 → completeTask() fails (AMS unavailable)

T2: New AMS Pod Starting
    - Loads old optimizer record from DB (token-A, stale touchTime)
    - registerOptimizer(token-A) → added to authOptimizers
    - loadTaskRuntimes: T1 loaded as ACKED into taskMap only (not re-queued)

T3: OptimizerKeeper processes expired optimizer
    - collectTasks: token-A still in authOptimizers → T1 NOT detected
    - unregisterOptimizer(token-A) → token-A removed

T4: Task T1 permanently stuck in ACKED state
    - No more keeper events (suspendingQueue is empty)
    - allTasksPrepared() → false → table stuck forever

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    type:bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions