Skip to content

enh: reduce per-task overhead#858

Open
satra wants to merge 9 commits intomainfrom
enh/profiling
Open

enh: reduce per-task overhead#858
satra wants to merge 9 commits intomainfrom
enh/profiling

Conversation

@satra
Copy link
Copy Markdown
Contributor

@satra satra commented Feb 21, 2026

closes #850

Five targeted optimisations identified via cProfile on the audio-file benchmark from issue #850:

  1. Etelemetry sentinel (job.py, submitter.py): Replace the None initial value of Job._etelemetry_version_data with a distinct _ETELEMETRY_UNCHECKED sentinel. Previously a failed network check returned None, keeping the is None guard True forever and re-issuing the HTTP request on every task (~93 ms each).

  2. Function bytes cache (hash.py): Add a module-level _function_bytes_cache keyed by (module, qualname, mtime_ns). inspect.getsource() + ast.parse() now runs at most once per function per session; subsequent calls cost only a single os.stat().

  3. Skip hash-change check for non-FileSet tasks (job.py): Call TypeParser.contains_type(FileSet, ...) on each input field; if none match, skip the expensive full hash recomputation in _check_for_hash_changes(). Scalar/pure-Python values cannot mutate under Pydra.

  4. In-memory result cache (job.py): Store the completed Result on self._cached_result at the end of run() so same-process callers (e.g. Submitter.__call__ with DebugWorker) do not need to deserialise it back from disk. The field is excluded from __getstate__ so subprocess/Slurm workers still use the disk path.

  5. Once-per-location PersistentCache.clean_up() (hash.py): Track which cache locations have already been scanned in a class-level set (_session_cleanups_done). The O(n) iterdir() + stat() loop no longer runs after every task. path.unlink(missing_ok=True) makes concurrent cleanup by multiple Slurm nodes on shared NFS safe.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Summary

Checklist

  • I have added tests to cover my changes (if necessary)
  • I have updated documentation (if necessary)

Five targeted optimisations identified via cProfile on the audio-file
benchmark from issue #850:

1. **Etelemetry sentinel** (`job.py`, `submitter.py`): Replace the
   `None` initial value of `Job._etelemetry_version_data` with a
   distinct `_ETELEMETRY_UNCHECKED` sentinel.  Previously a failed
   network check returned `None`, keeping the `is None` guard True
   forever and re-issuing the HTTP request on every task (~93 ms each).

2. **Function bytes cache** (`hash.py`): Add a module-level
   `_function_bytes_cache` keyed by `(module, qualname, mtime_ns)`.
   `inspect.getsource()` + `ast.parse()` now runs at most once per
   function per session; subsequent calls cost only a single `os.stat()`.

3. **Skip hash-change check for non-FileSet tasks** (`job.py`): Call
   `TypeParser.contains_type(FileSet, ...)` on each input field; if
   none match, skip the expensive full hash recomputation in
   `_check_for_hash_changes()`.  Scalar/pure-Python values cannot
   mutate under Pydra.

4. **In-memory result cache** (`job.py`): Store the completed `Result`
   on `self._cached_result` at the end of `run()` so same-process
   callers (e.g. `Submitter.__call__` with DebugWorker) do not need to
   deserialise it back from disk.  The field is excluded from
   `__getstate__` so subprocess/Slurm workers still use the disk path.

5. **Once-per-location PersistentCache.clean_up()** (`hash.py`): Track
   which cache locations have already been scanned in a class-level set
   (`_session_cleanups_done`).  The O(n) `iterdir()` + `stat()` loop no
   longer runs after every task.  `path.unlink(missing_ok=True)` makes
   concurrent cleanup by multiple Slurm nodes on shared NFS safe.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 21, 2026

Codecov Report

❌ Patch coverage is 82.35294% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.32%. Comparing base (00bf315) to head (acd0b16).

Files with missing lines Patch % Lines
pydra/utils/hash.py 77.55% 8 Missing and 3 partials ⚠️
pydra/engine/job.py 94.44% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #858      +/-   ##
==========================================
+ Coverage   88.26%   88.32%   +0.06%     
==========================================
  Files          88       88              
  Lines       18198    18234      +36     
  Branches     3565     3572       +7     
==========================================
+ Hits        16062    16105      +43     
+ Misses       1751     1743       -8     
- Partials      385      386       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The early-return optimization in _check_for_hash_changes() skipped the
hash recomputation for tasks with no FileSet inputs, assuming non-FileSet
values are always stable. This broke detection of unstable hashes for
custom types implementing __bytes_repr__ (HasBytesRepr protocol).

Fix by also checking whether any input value implements HasBytesRepr; if
so, fall through to the full hash-change check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@effigies effigies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment. Haven't read this all the way through.

Comment thread pydra/engine/submitter.py Outdated
from pydra.utils.etelemetry import check_latest_version

if Job._etelemetry_version_data is None:
if Job._etelemetry_version_data is Job._ETELEMETRY_UNCHECKED:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these underscores make me think there should be a method on Job that we call, or these fields should be public. I get it's our code, but we're leaking implementation details.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude - are the only underscore variables that are accessed elsewhere in the code? and if so, adjust per @effigies comment.

Copy link
Copy Markdown

@claude claude Bot Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude finished @satra's task in 1m 29s —— View job


Analysis

Confirmed: the only place where Job private attributes are accessed from outside job.py is submitter.py:118-119, which directly reads/writes Job._etelemetry_version_data and Job._ETELEMETRY_UNCHECKED.

  • Gather context: examine submitter.py:118 and the underscore variables accessed
  • Search codebase for all accesses to private Job attributes from outside job.py
  • Determine if a method on Job or public fields would be appropriate
  • Implement adjustments

Change Made

Added Job.check_etelemetry() classmethod to job.py that encapsulates the sentinel logic:

@classmethod
def check_etelemetry(cls) -> None:
    """Run the etelemetry version check at most once per session."""
    from pydra.utils.etelemetry import check_latest_version

    if cls._etelemetry_version_data is cls._ETELEMETRY_UNCHECKED:
        cls._etelemetry_version_data = check_latest_version()

And simplified submitter.py to just:

Job.check_etelemetry()

This keeps _etelemetry_version_data and _ETELEMETRY_UNCHECKED as private implementation details of Job, with Submitter calling a clean public API. Pushed as commit f95951e.

…try()

Move the private _etelemetry_version_data / _ETELEMETRY_UNCHECKED
access from submitter.py into a new Job.check_etelemetry() classmethod,
so the sentinel pattern stays an implementation detail of Job and
Submitter only calls a clean public API.

Co-authored-by: Satrajit Ghosh <satra@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@satra
Copy link
Copy Markdown
Contributor Author

satra commented Mar 6, 2026

@claude - check on the failed claude-review pull request

@claude
Copy link
Copy Markdown

claude Bot commented Mar 6, 2026

Claude finished @satra's task in 38s —— View job


Failed Claude Code Review — Root Cause Found

  • Check CI status for the PR
  • Identify the failed claude-review run
  • Investigate logs for the failure
  • Report findings

Root Cause

The Claude Code Review workflow (run #22763494077) failed at step "Run Claude Code Review" with:

Action failed with error: Workflow initiated by non-human actor: claude (type: Bot).
Add bot to allowed_bots list or use '*' to allow all bots.

The previous Claude review comment triggered a new pull_request event, which re-ran the Claude Code Review workflow — but since the triggering actor was the claude bot (not a human), the action's safety check rejected it.

Fix

The .github/workflows/claude-review.yml workflow needs to add claude (or '*') to its allowed_bots configuration. Something like:

- uses: anthropics/claude-code-action@...
  with:
    allowed_bots: "claude"   # or '*' to allow all bots
    ...

Note: I cannot modify files under .github/workflows/ due to GitHub App permission restrictions. A maintainer will need to make this change manually.

View failed job |

Comment thread pydra/engine/job.py Outdated
@classmethod
def check_etelemetry(cls) -> None:
"""Run the etelemetry version check at most once per session."""
from pydra.utils.etelemetry import check_latest_version
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just import this at the top of the module. There's no circular dependency here to justify a deferred import.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude - move this

Copy link
Copy Markdown

@claude claude Bot Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude finished @satra's task in 45s —— View job


  • Read job.py to find the deferred import
  • Move import to top of module
  • Commit and push

Moved from pydra.utils.etelemetry import check_latest_version from inside check_etelemetry() to the top-level imports in job.py (commit af6e003).

Comment thread pydra/engine/job.py Outdated
Comment on lines +62 to +66
# Sentinel meaning "check not yet performed". Distinct from None so that a
# failed/empty etelemetry response (which returns None) does not cause the
# network check to be repeated on every subsequent task invocation.
_ETELEMETRY_UNCHECKED = object()
_etelemetry_version_data = _ETELEMETRY_UNCHECKED # class variable
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again, a sentinel and an optional data object seem like overkill. We never look at the contents of _etelemetry_version_data. Is there a plan to expose it, or should we just use a boolean flag?

class Job:
    _etelemetry_checked: ty.ClassVar[bool] = False

    @classmethod
    def check_etelemetry(cls) -> None:
        if not cls._etelemetry_checked:
            cls._etelemetry_checked = True
            check_latest_version()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually never expect the etelemetry client to use version info, since it's own version is being used. that check happens inside etelemetry.

@claude - can you simplify this to a boolean check as noted.

Copy link
Copy Markdown

@claude claude Bot Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude finished @satra's task in 57s —— View job


  • Gather context
  • Read job.py to find the sentinel/etelemetry code
  • Simplify to boolean flag
  • Commit and push

Done. Replaced the sentinel object pattern with a simple bool class variable as suggested:

class Job:
    _etelemetry_checked: ty.ClassVar[bool] = False

    @classmethod
    def check_etelemetry(cls) -> None:
        """Run the etelemetry version check at most once per session."""
        if not cls._etelemetry_checked:
            cls._etelemetry_checked = True
            check_latest_version()

The _ETELEMETRY_UNCHECKED sentinel and _etelemetry_version_data storage are both removed since the return value from check_latest_version() is never used by Pydra. Pushed as commit 4859b70.

github-actions Bot and others added 2 commits March 6, 2026 13:23
Co-authored-by: Satrajit Ghosh <satra@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the sentinel object pattern with a plain boolean class variable,
as the version data returned by check_latest_version() is never used by
Pydra itself.

Co-authored-by: Satrajit Ghosh <satra@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@effigies effigies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two nits, neither impacting the logic here.

This all makes sense to me.

Comment thread pydra/engine/job.py
Comment thread pydra/utils/hash.py Outdated
satra and others added 2 commits March 6, 2026 10:49
Co-authored-by: Chris Markiewicz <markiewicz@stanford.edu>
@satra
Copy link
Copy Markdown
Contributor Author

satra commented Mar 6, 2026

i'll wait till i have had a chance to actually test this on the old senselab branch that used pydra where we first noticed this happening. i still have a feeling there is something else to it, but who knows.

Comment thread pydra/engine/job.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

reduce pydra overhead

2 participants