Summary
Slurm.parse_status returns JobStatus.FAILED when sacct produces no output rows,
and JobManager.wait_for_job treats that as terminal and exits.
For a freshly submitted job that has not yet been registered in Slurm's accounting database,
sacct -j <id> legitimately returns empty,
so hpc wait <id> can exit with a FAILED status before the job has a chance to run.
This is pre-existing behavior surfaced during the array-job aggregation work in #7 / #10
but not introduced by it.
Affected code
src/hpc/scheduler.py, Slurm.parse_status:
lines = []
for ln in output.strip().splitlines():
tokens = ln.split()
if tokens:
lines.append(tokens[0].rstrip("+"))
if not lines:
return JobStatus.FAILED
JobManager.wait_for_job only retries on SSHError,
not on a successful command that returns empty stdout,
so the FAILED propagates to the caller and hpc wait exits non-zero immediately.
Reproduction
$ hpc submit "sleep 60"
job 12345
$ hpc wait 12345 # within ~1 second of submit, before sacct registers the job
# returns FAILED while the job is queued/running
The window is short on a quiet cluster
but can be much longer on a busy controller or under SlurmDBD lag.
Proposal
Treat empty sacct output as transient (analogous to a transient SSH failure)
and keep polling instead of returning FAILED.
Options:
- Distinguish "empty" from "non-empty unknown" in
parse_status —
e.g., return JobStatus.PENDING (or a new sentinel) for the empty case,
so wait_for_job continues.
- Move the empty-vs-nonempty decision into
JobManager.get_job_status / wait_for_job
and keep parse_status purely a string → enum mapping.
The first option is the smaller change; the second avoids overloading the enum.
A bounded retry budget should also be applied
so a job that genuinely never appears in accounting (submission failed silently)
does not loop forever.
Summary
Slurm.parse_statusreturnsJobStatus.FAILEDwhensacctproduces no output rows,and
JobManager.wait_for_jobtreats that as terminal and exits.For a freshly submitted job that has not yet been registered in Slurm's accounting database,
sacct -j <id>legitimately returns empty,so
hpc wait <id>can exit with aFAILEDstatus before the job has a chance to run.This is pre-existing behavior surfaced during the array-job aggregation work in #7 / #10
but not introduced by it.
Affected code
src/hpc/scheduler.py,Slurm.parse_status:JobManager.wait_for_jobonly retries onSSHError,not on a successful command that returns empty stdout,
so the
FAILEDpropagates to the caller andhpc waitexits non-zero immediately.Reproduction
The window is short on a quiet cluster
but can be much longer on a busy controller or under SlurmDBD lag.
Proposal
Treat empty sacct output as transient (analogous to a transient SSH failure)
and keep polling instead of returning
FAILED.Options:
parse_status—e.g., return
JobStatus.PENDING(or a new sentinel) for the empty case,so
wait_for_jobcontinues.JobManager.get_job_status/wait_for_joband keep
parse_statuspurely a string → enum mapping.The first option is the smaller change; the second avoids overloading the enum.
A bounded retry budget should also be applied
so a job that genuinely never appears in accounting (submission failed silently)
does not loop forever.