Skip to content

scheduler.parse_status mis-classifies empty output as FAILED #8

@ultimatile

Description

@ultimatile

Summary

Slurm.parse_status and PJM.parse_status both fall back to JobStatus.FAILED when the scheduler's status command returns no data row. This conflates two distinct cases:

  1. The scheduler genuinely reports the job as failed.
  2. The scheduler does not yet have a row for the job (e.g., Slurm sacct immediately after submission, before accounting indexing catches up).

Affected code

src/hpc/scheduler.py:

class Slurm(Scheduler):
    def parse_status(self, output: str) -> JobStatus:
        lines = output.strip().splitlines()
        status_str = lines[0].strip().rstrip("+") if lines else ""
        return _STATUS_MAP.get(status_str, JobStatus.FAILED)
class PJM(Scheduler):
    def parse_status(self, output: str) -> JobStatus:
        lines = output.strip().splitlines()
        status_str = lines[1].strip() if len(lines) >= 2 else ""
        return self._STATUS_MAP.get(status_str, JobStatus.FAILED)

For Slurm, an empty sacct response gives status_str = ""JobStatus.FAILED.
For PJM, a header-only pjstat response gives status_str = ""JobStatus.FAILED.

Impact

hpc job-output --follow <id> (added in #4) inspects the status before deciding whether to use tail -F (active) or fall back to cat (terminal). When the scheduler has not yet indexed the just-submitted job, the misparsed FAILED sends the command to the cat path; the output file does not exist yet, so the user gets a No such file error instead of a streaming view.

hpc wait is also affected: it currently treats FAILED as a terminal state and stops polling, so a wait launched immediately after submission can short-circuit before the job actually starts.

Suggested fix

Distinguish "no parseable row" from "FAILED" in parse_status. Sketch:

  • Add a SchedulerError exception in scheduler.py.
  • Have parse_status raise SchedulerError when its input is structurally insufficient (empty Slurm output, header-only PJM output).
  • In JobManager.get_job_status, convert SchedulerError to SSHError so existing callers (wait_for_job, get_job_output, tail_job_output) inherit the "transient / unknown status" handling they already implement for SSHError.

This keeps the JobStatus enum clean (no new UNKNOWN variant required) and reuses existing retry / fall-through paths in callers.

Repro

For a freshly submitted Slurm job before sacct indexing catches up:

job_id=$(hpc submit "sleep 30")
hpc status $job_id   # may print "FAILED" before the job has actually run

Environment

  • hpc 0.4.0
  • Slurm and PJM both affected

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions