Replace os.walk with du(1) for directory stats: fix permission failures and slow scans#21
Replace os.walk with du(1) for directory stats: fix permission failures and slow scans#21
Conversation
Use du(1) instead of Python os.walk + per-file lstat() to compute file count and total size. This addresses two limitations: 1. Permission handling: os.walk silently fails when the calling user lacks execute permission on a directory. On HPC clusters, du is often configured with elevated privileges (setuid, sudo wrapper) allowing it to traverse directories inaccessible to Python's path-based stat calls. 2. Performance: Python os.walk + per-file stat() is slow on large trees and network filesystems (NFS, Lustre, GPFS) where each stat call incurs network latency. du is a compiled C program that completes in a single subprocess call. Changes: - Add import subprocess - Add _run_du() private helper that invokes du --apparent-size --block-size=1 with optional -a flag and --exclude per pattern - Rewrite get_directory_stats to call _run_du twice: without -a to collect directory paths, then with -a for all entries; files are entries not in the directory set - Group-membership filtering preserved via lstat().st_uid (best-effort; OSError falls through to include the file in the count) - show_progress parameter kept for API compatibility (no-op with du) - 7 new tests: 6 for _run_du, 1 for no-execute-permission scenario Agent-Logs-Url: https://github.com/FertigLab/ontrack/sessions/0056c2c2-596a-45dd-9d8f-aeb1e1e67e39 Co-authored-by: dimalvovs <1246862+dimalvovs@users.noreply.github.com>
dimalvovs
left a comment
There was a problem hiding this comment.
get_directory_stats does not return number of files and total size, I think due to the groups parameter when applied to the reporting directory. My understanding was that the groups parameter is used only to find the reporting directories, but once a reporting directory has been found, it should be assessed for number of files and total size without the group constraint.
groups is used only to locate reporting directories, not to filter files within them. Remove the groups= forwarding from _build_directory_entry to get_directory_stats so all files inside a reporting directory are counted unconditionally. Add test_build_directory_entry_group_does_not_filter_stats to pin this. Agent-Logs-Url: https://github.com/FertigLab/ontrack/sessions/fc73e57a-939b-465c-afd3-80fe3226b6f6 Co-authored-by: dimalvovs <1246862+dimalvovs@users.noreply.github.com>
Fixed in 318b478. |
get_directory_statsusedos.walk+ per-filelstat(), which silently returns zero counts for directories where the calling user lacks execute permission, and is slow on large trees over network filesystems (NFS, Lustre, GPFS) due to per-file syscall overhead.duhandles both: it works with elevated privileges configured at the system level (setuid/sudo wrapper common on HPC clusters), and completes a full tree scan in a single subprocess call.Changes
ontrack.py_run_du(path, patterns, all_files)— thin wrapper arounddu --apparent-size --block-size=1 [--exclude=…]. Withall_files=False(default) emits only directories; withall_files=Trueemits everything (-a). Returnslist[tuple[int, str]](bytes, path). Logs warnings on non-zero exit without raising.get_directory_stats— two_run_ducalls: first without-ato build adir_pathsset, then with-ato iterate all entries. File entries are those absent fromdir_paths. Sizes come directly fromduoutput; group-ownership filtering still useslstat().st_uidas best-effort (files that can't be stat'd are counted rather than silently dropped, consistent withduhaving already seen them).show_progresskept in the signature for API compatibility; it's a no-op sinceduruns atomically.tests/test_ontrack.pytest_run_du_*tests covering basic output, dirs-only mode, size accuracy,--excludebehaviour, empty dir, and nonexistent path.test_get_directory_stats_no_execute_subdir— verifies accessible files are still counted when a sibling subdirectory has its execute bit removed (skipped whenuid == 0).