We are encountering an issue when working with Slurm job arrays via pyslurm.Job.load().
In a job array where:
SLURM_JOB_ID == SLURM_ARRAY_JOB_ID (i.e. the array parent job ID)
- The parent job has finished
- Some array tasks are still running
Calling:
where job_id is the array parent ID, results in:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "pyslurm/core/job/job.pyx", line 307, in pyslurm.core.job.job.Job.load
File "pyslurm/core/job/job.pyx", line 300, in pyslurm.core.job.job.Job.load
File "pyslurm/core/job/step.pyx", line 103, in pyslurm.core.job.step.JobSteps._load_single
KeyError: 222
(where 222 is the job ID in this case)
Environment:
- Slurm version: 24.11.6
- pyslurm version: 24.11.0
Analysis:
Job.load() attempts to load job steps as part of job construction.
Normally:
- If the returned dictionary of steps (from
JobSteps._load_data()) is empty, an RPC error is raised.
|
data = steps._load_data(job.id, slurm.SHOW_ALL) |
|
if not data and not slurm.IS_JOB_PENDING(job.ptr): |
|
msg = f"Failed to load step info for Job {job.id}." |
|
raise RPCError(msg=msg) |
- That RPC error is handled in
Job.load().
|
if not slurm.IS_JOB_PENDING(wrap.ptr): |
|
# Just ignore if the steps couldn't be loaded here. |
|
try: |
|
wrap.steps = JobSteps._load_single(wrap) |
|
except RPCError: |
|
pass |
The problematic case appears to be:
JobSteps._load_single() is called for the array parent job ID.
- The RPC and thus
JobSteps._load_data() returns steps for all array elements that still have running steps.
- The parent job itself has no steps (it is already finished).
- Therefore, the returned dictionary is non-empty, but does not contain an entry for the parent job ID.
JobSteps._load_single() then attempts to index into the dictionary using the parent job ID.
- Since that key does not exist, a
KeyError is raised.
- This bypasses the normal RPC error handling path in
Job.load().
By contrast:
- If a single (non-array) job is finished,
JobSteps._load_data() returns an empty dict.
- That empty dict correctly triggers the RPC error path, which is handled.
So the failure only occurs when:
- The array parent is finished, and
- Some child tasks are still running.
We are encountering an issue when working with Slurm job arrays via
pyslurm.Job.load().In a job array where:
SLURM_JOB_ID == SLURM_ARRAY_JOB_ID(i.e. the array parent job ID)Calling:
where job_id is the array parent ID, results in:
(where 222 is the job ID in this case)
Environment:
Analysis:
Job.load() attempts to load job steps as part of job construction.
Normally:
JobSteps._load_data()) is empty, an RPC error is raised.pyslurm/pyslurm/core/job/step.pyx
Lines 98 to 101 in 2adb2da
Job.load().pyslurm/pyslurm/core/job/job.pyx
Lines 297 to 302 in 2adb2da
The problematic case appears to be:
JobSteps._load_single()is called for the array parent job ID.JobSteps._load_data()returns steps for all array elements that still have running steps.JobSteps._load_single()then attempts to index into the dictionary using the parent job ID.KeyErroris raised.Job.load().By contrast:
JobSteps._load_data()returns an empty dict.So the failure only occurs when: