Skip to content

babs-status: replace pandas with dataclasses#359

Open
asmacdo wants to merge 2 commits intoPennLINC:mainfrom
asmacdo:babs-status-no-pd
Open

babs-status: replace pandas with dataclasses#359
asmacdo wants to merge 2 commits intoPennLINC:mainfrom
asmacdo:babs-status-no-pd

Conversation

@asmacdo
Copy link
Copy Markdown
Collaborator

@asmacdo asmacdo commented Apr 6, 2026

claude-assisted design doc in this PR:

  • design/status-without-pandas.md for full context and planning.

  • no changes to (see more in design doc on scope)

    • CLI (ie, does not add new --wait feature from Add --wait flag to babs status #354)
    • model (columns) of job_status.csv. TODO: how badly we want to keep as is?
    • pandas is still in requirements, as used for other components
  • changed:

    • internal interfaces
    • fixes the ghost-job bug where a completed job with results was counted as failed when squeue returned empty, caused by pandas dtype issues with bitwise NOT on object-dtype booleans.
    • no longer allows job id to be optional for squeue (it was always used before anyway, and returning all user submitted jobs could cause problems

Similar previous effort

Drop DataFrames from the babs status pipeline. Job state is now represented by a SchedulerState enum and a JobStatus dataclass, with is_failed and submitted as derived properties instead of stored values.

TODOs:

  • make sure CI is green
  • make sure works "for me"
  • seek feedback as for plausibility of this development
  • establish the TODOs if agreed to proceed to finalize.

@asmacdo asmacdo force-pushed the babs-status-no-pd branch from 563f239 to d27680b Compare April 6, 2026 16:03
sources every run. The CSV persists two bits of state destroyed by later
operations: `submitted` (scheduler forgets completed jobs) and `has_results`
(branches deleted after merge). Everything else is convenience artifact.
Is this the right contract, or should we rethink?
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per discussion the question boils down on either it is to constitute some form of "interface" and thus "interface data model" which we should not change.

@asmacdo asmacdo force-pushed the babs-status-no-pd branch 2 times, most recently from 4714f37 to 9ee19ab Compare April 9, 2026 15:15
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 88.15166% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.51%. Comparing base (25278c6) to head (acaa875).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
babs/base.py 65.21% 6 Missing and 2 partials ⚠️
babs/update.py 56.25% 5 Missing and 2 partials ⚠️
babs/scheduler.py 71.42% 3 Missing and 3 partials ⚠️
babs/status.py 97.20% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #359      +/-   ##
==========================================
+ Coverage   79.36%   79.51%   +0.15%     
==========================================
  Files          16       17       +1     
  Lines        1846     1953     +107     
  Branches      311      330      +19     
==========================================
+ Hits         1465     1553      +88     
- Misses        267      279      +12     
- Partials      114      121       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@asmacdo asmacdo changed the title [WIP] Replace pandas with dataclasses in babs status babs-status: replace pandas with dataclasses Apr 9, 2026
Drop pandas from the status path. DataFrames become dataclasses +
dicts, csv module replaces pandas I/O, magic strings become enums,
and is_failed is derived in one place (a property) instead of three.

Key changes:
- New babs/status.py: JobStatus dataclass, SchedulerState enum,
  CSV read/write, update_from_branches, update_from_scheduler
- run_squeue returns raw text; parsing moves to status.py
- run_squeue job_id is now required (prevents accidental full-queue
  queries and crashes from non-array jobs)
- Strict validation of squeue array job format (jobid_taskid)
- Uses dataclasses.replace to reduce boilerplate in update functions

Bug fix: completed job with results no longer counted as failed
when squeue is empty (the ghost-job dtype bug).

CSV format is unchanged — no migration needed for existing projects.

See design/status-without-pandas.md for full rationale and data flow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@asmacdo asmacdo force-pushed the babs-status-no-pd branch from 9ee19ab to 4f9e308 Compare April 9, 2026 16:43
test_babs_init_raw_bids[subject] often exceeded the default 300s
timeout in CI. Increase to 450s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@asmacdo asmacdo marked this pull request as ready for review April 9, 2026 18:38
@asmacdo asmacdo mentioned this pull request Apr 9, 2026
@asmacdo
Copy link
Copy Markdown
Collaborator Author

asmacdo commented Apr 9, 2026

One additional manual test we did: On a dataset with 2 subjects, we initially did a babs submit --count 1, and ran babs status. After that, we ran babs submit --count 1 --skip-running-jobs, then babs status again. This has been buggy before but worked with this PR, both jobs showed up as "running", and later "finished"

final csv state:

sub_id,submitted,is_failed,state,time_used,time_limit,nodes,cpus,partition,name,job_id,task_id,has_results
sub-02,True,False,,1:59,1-00:00:00,1,4,standard,bid,8055585,1,True
sub-13,True,False,,0:12,1-00:00:00,1,4,standard,bid,8055616,1,True

@tien-tong
Copy link
Copy Markdown
Contributor

Issues raised by Codex review

1) High: Removing all inclusion rows can leave stale job_status.csv on disk

BABSUpdate now writes via write_job_status_csv(), but that writer is a no-op for empty dicts.
If an update removes all tracked jobs, the in-memory statuses becomes empty and the old CSV is not overwritten/truncated, so removed jobs remain in job_status.csv.

  • wrong status accounting after inclusion shrink-to-empty
  • stale jobs can be re-evaluated as submitted/failed later

2) Medium: ses_id truthiness checks can crash or mis-key on NA values

Several places use if ses_id to decide whether key is subject-level or session-level.
With pd.NA, this raises TypeError; with NaN, it is truthy and creates keys like (sub_id, nan) that never match existing (sub_id, 'ses-*').

base.py (Lines 410–414)

ses_id = row.get('ses_id') if 'ses_id' in merged_zip_df.columns else None
key = (row['sub_id'], ses_id) if ses_id else (row['sub_id'],)
if key in statuses:
    statuses[key] = replace(statuses[key], has_results=True)

update.py (Lines 118-127)

ses_id = row.get('ses_id') if 'ses_id' in removed_rows.columns else None
key = (row['sub_id'], ses_id) if ses_id else (row['sub_id'],)
# ...
ses_id = row.get('ses_id') if 'ses_id' in added_rows.columns else None
sub_id = row['sub_id']
key = (sub_id, ses_id) if ses_id else (sub_id,)

3) Medium: New strict scheduler-state parsing can break on valid non-covered Slurm states

SchedulerState.from_slurm_state only handles PD, R, CG, and CF, but SLURM squeue can report many other compact state codes — S (SUSPENDED), ST (STOPPED), CA (CANCELLED), F (FAILED), TO (TIMEOUT), OOM, RQ (REQUEUED), and others. When update_from_scheduler encounters any of these in live squeue output, it raises ValueError, crashing the entire babs status command. The old code tolerated arbitrary state strings without crashing.

status.py (Lines 27-33)

@classmethod
def from_slurm_state(cls, state_str: str) -> 'SchedulerState':
    """Convert a SLURM squeue state string to a SchedulerState."""
    for member in cls:
        if member.value == state_str:
            return member
    raise ValueError(f'Unknown SLURM scheduler state: {state_str!r}')

status.py (Lines 299-304)

if squeue_info is not None:
    updated[key] = replace(
        job,
        scheduler_state=SchedulerState.from_slurm_state(squeue_info['state']),

Backward-compat angle:

  • old path tolerated arbitrary state strings (even if imperfectly classified)
  • new path hard-fails for unrecognized active states

4) Low: Internal helper API compatibility changed

report_job_status signature changed from DataFrame-based inputs to dict-based status model:

scheduler.py (Lines 311-319)

def report_job_status(statuses, analysis_path):
    """
    Print a report that summarizes the overall status of a BABS project.
    Parameters
    ----------
    statuses : dict[tuple, JobStatus]

CLI behavior is updated, but downstream code importing babs.scheduler.report_job_status directly may break.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants