Skip to content

Replace os.walk with du(1) for directory stats: fix permission failures and slow scans#21

Draft
Copilot wants to merge 3 commits intomainfrom
copilot/refactor-implementation-for-reporting
Draft

Replace os.walk with du(1) for directory stats: fix permission failures and slow scans#21
Copilot wants to merge 3 commits intomainfrom
copilot/refactor-implementation-for-reporting

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 31, 2026

get_directory_stats used os.walk + per-file lstat(), which silently returns zero counts for directories where the calling user lacks execute permission, and is slow on large trees over network filesystems (NFS, Lustre, GPFS) due to per-file syscall overhead. du handles both: it works with elevated privileges configured at the system level (setuid/sudo wrapper common on HPC clusters), and completes a full tree scan in a single subprocess call.

Changes

ontrack.py

  • New _run_du(path, patterns, all_files) — thin wrapper around du --apparent-size --block-size=1 [--exclude=…]. With all_files=False (default) emits only directories; with all_files=True emits everything (-a). Returns list[tuple[int, str]] (bytes, path). Logs warnings on non-zero exit without raising.
  • Rewrote get_directory_stats — two _run_du calls: first without -a to build a dir_paths set, then with -a to iterate all entries. File entries are those absent from dir_paths. Sizes come directly from du output; group-ownership filtering still uses lstat().st_uid as best-effort (files that can't be stat'd are counted rather than silently dropped, consistent with du having already seen them).
  • show_progress kept in the signature for API compatibility; it's a no-op since du runs atomically.

tests/test_ontrack.py

  • Six new test_run_du_* tests covering basic output, dirs-only mode, size accuracy, --exclude behaviour, empty dir, and nonexistent path.
  • test_get_directory_stats_no_execute_subdir — verifies accessible files are still counted when a sibling subdirectory has its execute bit removed (skipped when uid == 0).

Use du(1) instead of Python os.walk + per-file lstat() to compute file
count and total size. This addresses two limitations:

1. Permission handling: os.walk silently fails when the calling user
   lacks execute permission on a directory. On HPC clusters, du is
   often configured with elevated privileges (setuid, sudo wrapper)
   allowing it to traverse directories inaccessible to Python's
   path-based stat calls.

2. Performance: Python os.walk + per-file stat() is slow on large trees
   and network filesystems (NFS, Lustre, GPFS) where each stat call
   incurs network latency. du is a compiled C program that completes in
   a single subprocess call.

Changes:
- Add import subprocess
- Add _run_du() private helper that invokes du --apparent-size
  --block-size=1 with optional -a flag and --exclude per pattern
- Rewrite get_directory_stats to call _run_du twice: without -a to
  collect directory paths, then with -a for all entries; files are
  entries not in the directory set
- Group-membership filtering preserved via lstat().st_uid (best-effort;
  OSError falls through to include the file in the count)
- show_progress parameter kept for API compatibility (no-op with du)
- 7 new tests: 6 for _run_du, 1 for no-execute-permission scenario

Agent-Logs-Url: https://github.com/FertigLab/ontrack/sessions/0056c2c2-596a-45dd-9d8f-aeb1e1e67e39

Co-authored-by: dimalvovs <1246862+dimalvovs@users.noreply.github.com>
Copilot AI changed the title [WIP] Refactor implementation to improve reporting capabilities Replace os.walk with du(1) for directory stats: fix permission failures and slow scans Mar 31, 2026
Copilot AI requested a review from dimalvovs March 31, 2026 18:21
Copy link
Copy Markdown
Contributor

@dimalvovs dimalvovs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_directory_stats does not return number of files and total size, I think due to the groups parameter when applied to the reporting directory. My understanding was that the groups parameter is used only to find the reporting directories, but once a reporting directory has been found, it should be assessed for number of files and total size without the group constraint.

groups is used only to locate reporting directories, not to filter
files within them. Remove the groups= forwarding from _build_directory_entry
to get_directory_stats so all files inside a reporting directory are
counted unconditionally.

Add test_build_directory_entry_group_does_not_filter_stats to pin this.

Agent-Logs-Url: https://github.com/FertigLab/ontrack/sessions/fc73e57a-939b-465c-afd3-80fe3226b6f6

Co-authored-by: dimalvovs <1246862+dimalvovs@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 31, 2026

get_directory_stats does not return number of files and total size, I think due to the groups parameter when applied to the reporting directory. My understanding was that the groups parameter is...

Fixed in 318b478. _build_directory_entry was forwarding groups to get_directory_stats, which caused files in the reporting directory to be filtered by group ownership. Since the reporting directory was already identified by get_group_subdirectories, all files inside it should be counted unconditionally. The fix removes the groups=groups argument from the get_directory_stats call inside _build_directory_entry; groups is now only used to label the output entry.

Copilot AI requested a review from dimalvovs March 31, 2026 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants