Skip to content

PJM parse_status uses fixed lines[1] index, may misreport multi-row jobs #12

@ultimatile

Description

@ultimatile

Summary

PJM.parse_status reads lines[1] of pjstat --choose st <id> and returns the corresponding JobStatus,
ignoring every other row.
This is the same shape as the Slurm bug fixed in #7 (PR #10):
when a job's status command returns more than one meaningful row,
only the first row contributes to the aggregate,
which can let wait_for_job exit while later rows are still non-terminal.

Affected code

src/hpc/scheduler.py, PJM.parse_status:

def parse_status(self, output: str) -> JobStatus:
    lines = output.strip().splitlines()
    status_str = lines[1].strip() if len(lines) >= 2 else ""
    return self._STATUS_MAP.get(status_str, JobStatus.FAILED)

The same JobManager.wait_for_job polling path used for Slurm consumes this status,
so an array or step job whose first reported task happens to be terminal
will end the wait while the rest are still pending or running.

Verification needed

The fix likely mirrors the Slurm one (aggregate over all rows with priority RUNNING > PENDING > FAILED > CANCELLED > TIMEOUT > COMPLETED),
but pjstat --choose st <id> output for array / step jobs has not been observed yet.
Before applying the aggregation,
the following needs verification on a PJM cluster:

  • Whether pjstat --choose st <id> for an array job emits one row per task (analogous to sacct -X),
    or a different shape (e.g., one row for the parent + sub-rows for tasks).
  • Whether the column layout that puts the State value on lines[1] (after a header) is stable
    across PJM versions and against array / step jobs.
  • Whether there is a flag analogous to sacct -X that suppresses jobsteps so each row is one allocation.

Once the output shape is confirmed,
porting the Slurm aggregation logic should be straightforward.

Workaround

Until this is fixed,
PJM users running multi-task jobs should not rely on hpc wait for completion
and should instead poll with a scheduler-native command.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions