Fix race condition in job recovery causing external_id lookup failure#439
Open
mvdbeek wants to merge 1 commit intogalaxyproject:masterfrom
Open
Fix race condition in job recovery causing external_id lookup failure#439mvdbeek wants to merge 1 commit intogalaxyproject:masterfrom
mvdbeek wants to merge 1 commit intogalaxyproject:masterfrom
Conversation
The ManagerMonitor thread was starting before active jobs had their external IDs recovered from disk. This caused status checks to fail with "Failed to obtain external_id for job_id" errors on startup. Observed on usegalaxy.be with job 655740: - Monitor thread checked job status at 17:10:56,273 - MainThread recovered external_id (292) at 17:10:56,278 - The 5ms gap caused the status check to fail Fix: Reorder startup so __recover_jobs() runs before __setup_bind_to_message_queue(), ensuring external IDs are loaded before the monitor thread begins polling.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The ManagerMonitor thread was starting before active jobs had their external IDs recovered from disk. This caused status checks to fail with "Failed to obtain external_id for job_id" errors on startup.
Observed on usegalaxy.be with job 655740:
Fix: Reorder startup so __recover_jobs() runs before __setup_bind_to_message_queue(), ensuring external IDs are loaded before the monitor thread begins polling.
This is probably just a cosmetic / correctness fix since the thread would retry later.