Summary
PJM.parse_status reads lines[1] of pjstat --choose st <id> and returns the corresponding JobStatus,
ignoring every other row.
This is the same shape as the Slurm bug fixed in #7 (PR #10):
when a job's status command returns more than one meaningful row,
only the first row contributes to the aggregate,
which can let wait_for_job exit while later rows are still non-terminal.
Affected code
src/hpc/scheduler.py, PJM.parse_status:
def parse_status(self, output: str) -> JobStatus:
lines = output.strip().splitlines()
status_str = lines[1].strip() if len(lines) >= 2 else ""
return self._STATUS_MAP.get(status_str, JobStatus.FAILED)
The same JobManager.wait_for_job polling path used for Slurm consumes this status,
so an array or step job whose first reported task happens to be terminal
will end the wait while the rest are still pending or running.
Verification needed
The fix likely mirrors the Slurm one (aggregate over all rows with priority RUNNING > PENDING > FAILED > CANCELLED > TIMEOUT > COMPLETED),
but pjstat --choose st <id> output for array / step jobs has not been observed yet.
Before applying the aggregation,
the following needs verification on a PJM cluster:
- Whether
pjstat --choose st <id> for an array job emits one row per task (analogous to sacct -X),
or a different shape (e.g., one row for the parent + sub-rows for tasks).
- Whether the column layout that puts the State value on
lines[1] (after a header) is stable
across PJM versions and against array / step jobs.
- Whether there is a flag analogous to
sacct -X that suppresses jobsteps so each row is one allocation.
Once the output shape is confirmed,
porting the Slurm aggregation logic should be straightforward.
Workaround
Until this is fixed,
PJM users running multi-task jobs should not rely on hpc wait for completion
and should instead poll with a scheduler-native command.
Summary
PJM.parse_statusreadslines[1]ofpjstat --choose st <id>and returns the correspondingJobStatus,ignoring every other row.
This is the same shape as the Slurm bug fixed in #7 (PR #10):
when a job's status command returns more than one meaningful row,
only the first row contributes to the aggregate,
which can let
wait_for_jobexit while later rows are still non-terminal.Affected code
src/hpc/scheduler.py,PJM.parse_status:The same
JobManager.wait_for_jobpolling path used for Slurm consumes this status,so an array or step job whose first reported task happens to be terminal
will end the wait while the rest are still pending or running.
Verification needed
The fix likely mirrors the Slurm one (aggregate over all rows with priority
RUNNING > PENDING > FAILED > CANCELLED > TIMEOUT > COMPLETED),but
pjstat --choose st <id>output for array / step jobs has not been observed yet.Before applying the aggregation,
the following needs verification on a PJM cluster:
pjstat --choose st <id>for an array job emits one row per task (analogous tosacct -X),or a different shape (e.g., one row for the parent + sub-rows for tasks).
lines[1](after a header) is stableacross PJM versions and against array / step jobs.
sacct -Xthat suppresses jobsteps so each row is one allocation.Once the output shape is confirmed,
porting the Slurm aggregation logic should be straightforward.
Workaround
Until this is fixed,
PJM users running multi-task jobs should not rely on
hpc waitfor completionand should instead poll with a scheduler-native command.