Skip to content

wait_for_job exits prematurely on empty sacct output for fresh jobs #13

@ultimatile

Description

@ultimatile

Summary

Slurm.parse_status returns JobStatus.FAILED when sacct produces no output rows,
and JobManager.wait_for_job treats that as terminal and exits.
For a freshly submitted job that has not yet been registered in Slurm's accounting database,
sacct -j <id> legitimately returns empty,
so hpc wait <id> can exit with a FAILED status before the job has a chance to run.

This is pre-existing behavior surfaced during the array-job aggregation work in #7 / #10
but not introduced by it.

Affected code

src/hpc/scheduler.py, Slurm.parse_status:

lines = []
for ln in output.strip().splitlines():
    tokens = ln.split()
    if tokens:
        lines.append(tokens[0].rstrip("+"))
if not lines:
    return JobStatus.FAILED

JobManager.wait_for_job only retries on SSHError,
not on a successful command that returns empty stdout,
so the FAILED propagates to the caller and hpc wait exits non-zero immediately.

Reproduction

$ hpc submit "sleep 60"
job 12345
$ hpc wait 12345    # within ~1 second of submit, before sacct registers the job
# returns FAILED while the job is queued/running

The window is short on a quiet cluster
but can be much longer on a busy controller or under SlurmDBD lag.

Proposal

Treat empty sacct output as transient (analogous to a transient SSH failure)
and keep polling instead of returning FAILED.
Options:

  • Distinguish "empty" from "non-empty unknown" in parse_status
    e.g., return JobStatus.PENDING (or a new sentinel) for the empty case,
    so wait_for_job continues.
  • Move the empty-vs-nonempty decision into JobManager.get_job_status / wait_for_job
    and keep parse_status purely a string → enum mapping.

The first option is the smaller change; the second avoids overloading the enum.

A bounded retry budget should also be applied
so a job that genuinely never appears in accounting (submission failed silently)
does not loop forever.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions