Skip to content

Fix race condition in job recovery causing external_id lookup failure#439

Open
mvdbeek wants to merge 1 commit intogalaxyproject:masterfrom
mvdbeek:fix_lost_active_job_on_startup
Open

Fix race condition in job recovery causing external_id lookup failure#439
mvdbeek wants to merge 1 commit intogalaxyproject:masterfrom
mvdbeek:fix_lost_active_job_on_startup

Conversation

@mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented Feb 26, 2026

The ManagerMonitor thread was starting before active jobs had their external IDs recovered from disk. This caused status checks to fail with "Failed to obtain external_id for job_id" errors on startup.

Observed on usegalaxy.be with job 655740:

  • Monitor thread checked job status at 17:10:56,273
  • MainThread recovered external_id (292) at 17:10:56,278
  • The 5ms gap caused the status check to fail
2026-02-26 17:10:56,273 ERROR [pulsar.managers.stateful][[manager=production]-[action=monitor]] Active job IDs in directory '/data/share/persisted_data/production-active-jobs': ['655740']
2026-02-26 17:10:56,273 ERROR [pulsar.managers.stateful][[manager=production]-[action=monitor]] Failed checking active job status for job_id 655740
Traceback (most recent call last):
  File "/opt/pulsar/venv3/lib64/python3.9/site-packages/pulsar/managers/stateful.py", line 383, in _monitor_active_jobs
    self._check_active_job_status(active_job_id)
  File "/opt/pulsar/venv3/lib64/python3.9/site-packages/pulsar/managers/stateful.py", line 397, in _check_active_job_status
    self.stateful_manager.get_status(active_job_id)
  File "/opt/pulsar/venv3/lib64/python3.9/site-packages/pulsar/managers/stateful.py", line 165, in get_status
    proxy_status, state_change = self.__proxy_status(job_directory, job_id)
  File "/opt/pulsar/venv3/lib64/python3.9/site-packages/pulsar/managers/stateful.py", line 189, in __proxy_status
    proxy_status = self._proxied_manager.get_status(job_id)
  File "/opt/pulsar/venv3/lib64/python3.9/site-packages/pulsar/managers/queued_condor.py", line 68, in get_status
    raise Exception("Failed to obtain external_id for job_id %s, cannot determine status." % job_id)
Exception: Failed to obtain external_id for job_id 655740, cannot determine status.
2026-02-26 17:10:56,274 DEBUG [pulsar.client.amqp_exchange][consume-status-pyamqp://galaxy_vib:********@usegalaxy.be:5671//pulsar/galaxy_vib?ssl=1] Consuming queue '<unbound Queue pulsar_test__status -> <unbound Exchange pulsar(direct)> -> pulsar_test__status>'
2026-02-26 17:10:56,273 DEBUG [pulsar.client.amqp_exchange][consume-kill-pyamqp://galaxy_vib:********@usegalaxy.be:5671//pulsar/galaxy_vib?ssl=1] Consuming queue '<unbound Queue pulsar_test__kill -> <unbound Exchange pulsar(direct)> -> pulsar_test__kill>'
2026-02-26 17:10:56,276 ERROR [pulsar.managers.stateful][MainThread] Active job IDs in directory '/data/share/persisted_data/benchmarking-preprocessing-jobs': []
2026-02-26 17:10:56,276 ERROR [pulsar.managers.stateful][[manager=test]-[action=monitor]] Active job IDs in directory '/data/share/persisted_data/test-active-jobs': []
2026-02-26 17:10:56,276 ERROR [pulsar.managers.stateful][MainThread] Active job IDs in directory '/data/share/persisted_data/benchmarking-active-jobs': []
2026-02-26 17:10:56,277 ERROR [pulsar.managers.stateful][MainThread] Active job IDs in directory '/data/share/persisted_data/production-preprocessing-jobs': []
2026-02-26 17:10:56,277 ERROR [pulsar.managers.stateful][MainThread] Active job IDs in directory '/data/share/persisted_data/production-active-jobs': ['655740']
2026-02-26 17:10:56,278 ERROR [pulsar.managers.base.external][MainThread] Recovered external ID for job_id [655740]: 292
2026-02-26 17:10:56,278 ERROR [pulsar.managers.stateful][MainThread] Active job IDs in directory '/data/share/persisted_data/test-preprocessing-jobs': []
2026-02-26 17:10:56,278 ERROR [pulsar.managers.stateful][MainThread] Active job IDs in directory '/data/share/persisted_data/test-active-jobs': []

Fix: Reorder startup so __recover_jobs() runs before __setup_bind_to_message_queue(), ensuring external IDs are loaded before the monitor thread begins polling.

This is probably just a cosmetic / correctness fix since the thread would retry later.

The ManagerMonitor thread was starting before active jobs had their
external IDs recovered from disk. This caused status checks to fail
with "Failed to obtain external_id for job_id" errors on startup.

Observed on usegalaxy.be with job 655740:
- Monitor thread checked job status at 17:10:56,273
- MainThread recovered external_id (292) at 17:10:56,278
- The 5ms gap caused the status check to fail

Fix: Reorder startup so __recover_jobs() runs before
__setup_bind_to_message_queue(), ensuring external IDs are loaded
before the monitor thread begins polling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant