Fix race condition in job recovery causing external_id lookup failure by mvdbeek · Pull Request #439 · galaxyproject/pulsar

mvdbeek · 2026-02-26T16:25:32Z

The ManagerMonitor thread was starting before active jobs had their external IDs recovered from disk. This caused status checks to fail with "Failed to obtain external_id for job_id" errors on startup.

Observed on usegalaxy.be with job 655740:

Monitor thread checked job status at 17:10:56,273
MainThread recovered external_id (292) at 17:10:56,278
The 5ms gap caused the status check to fail

2026-02-26 17:10:56,273 ERROR [pulsar.managers.stateful][[manager=production]-[action=monitor]] Active job IDs in directory '/data/share/persisted_data/production-active-jobs': ['655740']
2026-02-26 17:10:56,273 ERROR [pulsar.managers.stateful][[manager=production]-[action=monitor]] Failed checking active job status for job_id 655740
Traceback (most recent call last):
  File "/opt/pulsar/venv3/lib64/python3.9/site-packages/pulsar/managers/stateful.py", line 383, in _monitor_active_jobs
    self._check_active_job_status(active_job_id)
  File "/opt/pulsar/venv3/lib64/python3.9/site-packages/pulsar/managers/stateful.py", line 397, in _check_active_job_status
    self.stateful_manager.get_status(active_job_id)
  File "/opt/pulsar/venv3/lib64/python3.9/site-packages/pulsar/managers/stateful.py", line 165, in get_status
    proxy_status, state_change = self.__proxy_status(job_directory, job_id)
  File "/opt/pulsar/venv3/lib64/python3.9/site-packages/pulsar/managers/stateful.py", line 189, in __proxy_status
    proxy_status = self._proxied_manager.get_status(job_id)
  File "/opt/pulsar/venv3/lib64/python3.9/site-packages/pulsar/managers/queued_condor.py", line 68, in get_status
    raise Exception("Failed to obtain external_id for job_id %s, cannot determine status." % job_id)
Exception: Failed to obtain external_id for job_id 655740, cannot determine status.
2026-02-26 17:10:56,274 DEBUG [pulsar.client.amqp_exchange][consume-status-pyamqp://galaxy_vib:********@usegalaxy.be:5671//pulsar/galaxy_vib?ssl=1] Consuming queue '<unbound Queue pulsar_test__status -> <unbound Exchange pulsar(direct)> -> pulsar_test__status>'
2026-02-26 17:10:56,273 DEBUG [pulsar.client.amqp_exchange][consume-kill-pyamqp://galaxy_vib:********@usegalaxy.be:5671//pulsar/galaxy_vib?ssl=1] Consuming queue '<unbound Queue pulsar_test__kill -> <unbound Exchange pulsar(direct)> -> pulsar_test__kill>'
2026-02-26 17:10:56,276 ERROR [pulsar.managers.stateful][MainThread] Active job IDs in directory '/data/share/persisted_data/benchmarking-preprocessing-jobs': []
2026-02-26 17:10:56,276 ERROR [pulsar.managers.stateful][[manager=test]-[action=monitor]] Active job IDs in directory '/data/share/persisted_data/test-active-jobs': []
2026-02-26 17:10:56,276 ERROR [pulsar.managers.stateful][MainThread] Active job IDs in directory '/data/share/persisted_data/benchmarking-active-jobs': []
2026-02-26 17:10:56,277 ERROR [pulsar.managers.stateful][MainThread] Active job IDs in directory '/data/share/persisted_data/production-preprocessing-jobs': []
2026-02-26 17:10:56,277 ERROR [pulsar.managers.stateful][MainThread] Active job IDs in directory '/data/share/persisted_data/production-active-jobs': ['655740']
2026-02-26 17:10:56,278 ERROR [pulsar.managers.base.external][MainThread] Recovered external ID for job_id [655740]: 292
2026-02-26 17:10:56,278 ERROR [pulsar.managers.stateful][MainThread] Active job IDs in directory '/data/share/persisted_data/test-preprocessing-jobs': []
2026-02-26 17:10:56,278 ERROR [pulsar.managers.stateful][MainThread] Active job IDs in directory '/data/share/persisted_data/test-active-jobs': []

Fix: Reorder startup so __recover_jobs() runs before __setup_bind_to_message_queue(), ensuring external IDs are loaded before the monitor thread begins polling.

This is probably just a cosmetic / correctness fix since the thread would retry later.

The ManagerMonitor thread was starting before active jobs had their external IDs recovered from disk. This caused status checks to fail with "Failed to obtain external_id for job_id" errors on startup. Observed on usegalaxy.be with job 655740: - Monitor thread checked job status at 17:10:56,273 - MainThread recovered external_id (292) at 17:10:56,278 - The 5ms gap caused the status check to fail Fix: Reorder startup so __recover_jobs() runs before __setup_bind_to_message_queue(), ensuring external IDs are loaded before the monitor thread begins polling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in job recovery causing external_id lookup failure#439

Fix race condition in job recovery causing external_id lookup failure#439
mvdbeek wants to merge 1 commit intogalaxyproject:masterfrom
mvdbeek:fix_lost_active_job_on_startup

mvdbeek commented Feb 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mvdbeek commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mvdbeek commented Feb 26, 2026 •

edited

Loading