Skip to content

[Enhancement] Use atomic directory rename for cache writes#1982

Open
LeiWang1999 wants to merge 2 commits intomainfrom
feature/atomic-cache-dir-rename
Open

[Enhancement] Use atomic directory rename for cache writes#1982
LeiWang1999 wants to merge 2 commits intomainfrom
feature/atomic-cache-dir-rename

Conversation

@LeiWang1999
Copy link
Copy Markdown
Member

@LeiWang1999 LeiWang1999 commented Mar 27, 2026

Summary

  • Replace per-file os.replace with write-to-staging + os.rename for both KernelCache and AutotuneResult, ensuring other processes never observe an incomplete cache directory.
  • Add _cleanup_stale_staging_dirs() to remove orphaned staging directories (>1h old) left by crashed processes, called on KernelCache singleton init.
  • Subclass save methods (_save_kernel_source_code_to_disk, _save_wrapper_kernel_code_to_disk, _save_so_cubin_to_disk) require no changes — they receive the staging path as cache_path.

Test plan

  • All existing tvm_ffi cache tests pass (pytest testing/python/cache/test_tilelang_kernel_cache.py -v)
  • No .staging_* directories remain after successful writes
  • Optional: concurrent multi-process write test for same cache key

Summary by CodeRabbit

  • Refactor
    • Improved reliability of save operations by writing to per-run staging directories and atomically publishing them to final locations to avoid partial writes.
    • Added automatic cleanup of stale temporary staging directories to limit buildup.
    • Strengthened error handling during atomic publish (race and other failures) with safe cleanup and logging.

Replace per-file os.replace with write-to-staging + os.rename to
ensure other processes never see an incomplete cache directory.
Also add cleanup of stale staging dirs on init.
@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 27, 2026

📝 Walkthrough

Walkthrough

Both modules now write kernel/artifact outputs into unique per-call staging directories and atomically publish them by renaming the staging directory to the final path; stale staging dirs are cleaned up and rename races are handled with targeted cleanup and logging. (47 words)

Changes

Cohort / File(s) Summary
Autotuner atomic save
tilelang/autotuner/param.py
save_to_disk() fast-paths when path already exists, writes all result files into a uniquely named staging directory, then atomically renames the staging dir to path. Adds shutil usage, try/except around os.rename, staging cleanup on errors, and docstring updates.
Kernel cache staging + cleanup
tilelang/cache/kernel_cache.py
Replaces direct writes with a staging-then-rename publish flow in _save_kernel_to_disk(). Adds _cleanup_stale_staging_dirs(max_age_seconds=3600) invoked during dir setup to remove old .staging_... folders. Handles OSError-rename races by removing the staging dir and logs other exceptions before cleanup.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Autotuner
    participant KernelCache
    participant Filesystem
    Autotuner->>Filesystem: check if final `path` exists
    alt final path exists
        Filesystem-->>Autotuner: exists -> fast return
    else
        Autotuner->>Filesystem: create unique staging dir `.staging_<uuid>` under cache/root
        Autotuner->>Filesystem: write artifacts (best_config, function, out_idx, latency, kernel files) into staging
        Autotuner->>KernelCache: invoke publish (rename staging -> final path)
        KernelCache->>Filesystem: attempt atomic rename(staging, final)
        alt rename success
            Filesystem-->>KernelCache: renamed
            KernelCache-->>Autotuner: publish complete
        else rename OSError (race)
            Filesystem-->>KernelCache: OSError
            KernelCache->>Filesystem: remove staging dir
            KernelCache-->>Autotuner: treat as published by other process
        else other Exception
            KernelCache->>Filesystem: remove staging dir
            KernelCache-->>Autotuner: log exception and fail
        end
    end
    Note right of KernelCache: periodic cleanup invoked during init:\nremove `.staging_...` older than TTL
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related PRs

  • [Bugfix] Fix autotune cache #1315: Modifies autotuner/param.py kernel persistence to use atomic/staging writes and safer save/load flows; directly related to the staging+rename semantics implemented here.

Poem

🐰
I dug a staging burrow neat and round,
Wrote kernels safe without a sound.
A single hop — a rename true,
No races here, stale nests cleared too.
Hop on, build fast, the cache is found! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and directly describes the main technical change: implementing atomic directory rename semantics for cache writes across both KernelCache and AutotuneResult modules.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/atomic-cache-dir-rename

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tilelang/autotuner/param.py (1)

430-435: Error handling pattern is reasonable.

The OSError catch handles the benign race where another process created the final path. The broad Exception catch (flagged by Ruff) is acceptable here since it ensures cleanup and logs the actual exception via logger.exception(). However, consider re-raising after cleanup if callers need to know the save failed:

♻️ Optional: re-raise after logging
         except Exception:
             shutil.rmtree(str(staging_path), ignore_errors=True)
             logger.exception("Error during atomic autotune result save")
+            raise
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tilelang/autotuner/param.py` around lines 430 - 435, The current exception
handling around the atomic autotune result save swallows non-OSError exceptions
after cleaning up staging_path and logging via logger.exception("Error during
atomic autotune result save"); update the except Exception block in the save
routine to perform the same cleanup (shutil.rmtree(str(staging_path),
ignore_errors=True)), call logger.exception with the existing message, and then
re-raise the exception so callers (callers of the atomic save function that uses
staging_path) can detect the failure; reference the staging_path variable and
the logger.exception call when locating the try/except to modify.
tilelang/cache/kernel_cache.py (1)

384-413: Partial cache entries can be committed if individual saves fail.

The inner try-except blocks log errors but allow execution to continue to the os.rename. If, for example, _save_kernel_source_code_to_disk fails but other saves succeed, an incomplete staging directory is still atomically renamed to the final path.

While _load_kernel_from_disk and _build_kernel handle missing components gracefully (returning None), this means:

  1. Incomplete cache entries occupy disk space permanently
  2. Every subsequent load attempt must re-check and fail

Consider tracking save success and only renaming if all critical files were written:

♻️ Optional: gate rename on successful saves
+        save_ok = True
         try:
             # Save kernel source code
             try:
                 self._save_kernel_source_code_to_disk(kernel, staging_path, verbose)
             except Exception:
                 self.logger.exception("Error saving kernel source code to disk")
+                save_ok = False
 
             # ... similar for other saves ...
 
+            if not save_ok:
+                shutil.rmtree(staging_path, ignore_errors=True)
+                return
+
             # Atomic rename — makes the complete directory visible in one step.
             os.rename(staging_path, cache_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tilelang/cache/kernel_cache.py` around lines 384 - 413, The staging directory
is being renamed to the final cache_path even if some inner saves fail; update
KernelCache.save (the block around _save_kernel_source_code_to_disk,
_save_wrapper_kernel_code_to_disk, _save_so_cubin_to_disk and
KernelCache._safe_write_file) to track success for each critical write (e.g.,
set an all_succeeded flag, flip to False on any Exception), and only perform
os.rename(staging_path, cache_path) when all_succeeded is True; if not all
succeeded, log a clear error and remove the incomplete staging directory
(shutil.rmtree or equivalent) to avoid persisting partial cache entries and
avoid committing partial state that _load_kernel_from_disk/_build_kernel would
later revisit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tilelang/autotuner/param.py`:
- Around line 377-383: The staging directory for the autotuner is created under
parent (in param.py within the staging_path creation block using Path(path).name
and os.getpid/uuid) which places it under TILELANG_CACHE_DIR/autotuner, but
KernelCache._cleanup_stale_staging_dirs() only scans env.TILELANG_CACHE_DIR; fix
by either creating staging dirs directly under env.TILELANG_CACHE_DIR in the
staging_path logic (replace parent with env.TILELANG_CACHE_DIR) or modify
KernelCache._cleanup_stale_staging_dirs() to also scan the autotuner
subdirectory (recursively or explicitly include TILELANG_CACHE_DIR /
"autotuner") so orphaned .staging_* directories created by the autotuner are
discovered and removed.

---

Nitpick comments:
In `@tilelang/autotuner/param.py`:
- Around line 430-435: The current exception handling around the atomic autotune
result save swallows non-OSError exceptions after cleaning up staging_path and
logging via logger.exception("Error during atomic autotune result save"); update
the except Exception block in the save routine to perform the same cleanup
(shutil.rmtree(str(staging_path), ignore_errors=True)), call logger.exception
with the existing message, and then re-raise the exception so callers (callers
of the atomic save function that uses staging_path) can detect the failure;
reference the staging_path variable and the logger.exception call when locating
the try/except to modify.

In `@tilelang/cache/kernel_cache.py`:
- Around line 384-413: The staging directory is being renamed to the final
cache_path even if some inner saves fail; update KernelCache.save (the block
around _save_kernel_source_code_to_disk, _save_wrapper_kernel_code_to_disk,
_save_so_cubin_to_disk and KernelCache._safe_write_file) to track success for
each critical write (e.g., set an all_succeeded flag, flip to False on any
Exception), and only perform os.rename(staging_path, cache_path) when
all_succeeded is True; if not all succeeded, log a clear error and remove the
incomplete staging directory (shutil.rmtree or equivalent) to avoid persisting
partial cache entries and avoid committing partial state that
_load_kernel_from_disk/_build_kernel would later revisit.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4881e1de-890a-4c7c-8e76-58b98e3f8e86

📥 Commits

Reviewing files that changed from the base of the PR and between dc60ab8 and 08976a2.

📒 Files selected for processing (2)
  • tilelang/autotuner/param.py
  • tilelang/cache/kernel_cache.py

Comment on lines +377 to +383
# Already saved (e.g. another process won the race).
if os.path.isdir(path):
return

# save function (atomic)
if verbose:
logger.debug(f"Saving function to file: {path / FUNCTION_PATH}")
self._safe_write_file(str(path / FUNCTION_PATH), "wb", lambda f: cloudpickle.dump(self.func, f))
parent = path.parent if isinstance(path, Path) else Path(path).parent
staging_path = parent / f".staging_{Path(path).name}_{os.getpid()}_{uuid.uuid4().hex[:8]}"
os.makedirs(staging_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check how save_to_disk is typically called to understand if path is under TILELANG_CACHE_DIR
rg -n "save_to_disk" --type=py -C5 | head -100

# Check if there's any cleanup mechanism for param.py staging dirs
rg -n "_cleanup_stale_staging" --type=py

Repository: tile-ai/tilelang

Length of output: 2769


🏁 Script executed:

# Check Tuner class initialization to see how cache_dir is set
rg -n "class Tuner" --type=py -A20 | head -50

# Get the full implementation of _cleanup_stale_staging_dirs
rg -n "_cleanup_stale_staging_dirs" --type=py -A20

# Check the full save_to_disk method to understand staging path logic
rg -n "def save_to_disk" --type=py -A30 tilelang/autotuner/param.py

Repository: tile-ai/tilelang

Length of output: 3283


🏁 Script executed:

# Find Tuner class __init__ and cache_dir initialization
rg -n "class Tuner" --type=py -A40 tilelang/autotuner/tuner.py | head -80

# Also search for any other calls to save_to_disk to see what paths are used
rg -n "save_to_disk" --type=py -B2 -A2

Repository: tile-ai/tilelang

Length of output: 1138


🏁 Script executed:

# Get Tuner class __init__ more directly
sed -n '1,200p' tilelang/autotuner/tuner.py | grep -n -A30 "def __init__"

# Or search for cache_dir assignment in tuner.py
rg -n "self.cache_dir" --type=py tilelang/autotuner/tuner.py

Repository: tile-ai/tilelang

Length of output: 1431


🏁 Script executed:

# Search for cache_dir assignment
rg -n "cache_dir\s*=" --type=py tilelang/autotuner/tuner.py

# Get full Tuner class definition
rg -n "^class Tuner" --type=py -A100 tilelang/autotuner/tuner.py | head -150

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

# Get context around line 309 to find which class it belongs to
sed -n '200,350p' tilelang/autotuner/tuner.py | grep -n -B50 "save_to_disk" | head -80

# Search for cache_dir definition anywhere
rg -n "cache_dir" --type=py tilelang/autotuner/ -B2 -A2

Repository: tile-ai/tilelang

Length of output: 3440


🏁 Script executed:

# Check if there's any other cleanup mechanism for autotuner staging dirs
rg -n "staging" --type=py tilelang/autotuner/

# Also check if there's any cleanup in AutotuneResult class
rg -n "cleanup|rmtree|remove" --type=py tilelang/autotuner/param.py

Repository: tile-ai/tilelang

Length of output: 1938


Staging directory cleanup does not handle autotuner subdirectory.

The staging directory is created at parent / ".staging_..." where parent = (TILELANG_CACHE_DIR / "autotuner"), but KernelCache._cleanup_stale_staging_dirs() only scans env.TILELANG_CACHE_DIR directly (line 146 in kernel_cache.py). Orphaned staging directories left by crashed processes in TILELANG_CACHE_DIR/autotuner/.staging_* will never be cleaned up.

Consider either:

  1. Place staging directories directly under env.TILELANG_CACHE_DIR instead of in the autotuner subdirectory, or
  2. Update _cleanup_stale_staging_dirs() to recursively scan subdirectories (or add a separate cleanup call specific to the autotuner directory).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tilelang/autotuner/param.py` around lines 377 - 383, The staging directory
for the autotuner is created under parent (in param.py within the staging_path
creation block using Path(path).name and os.getpid/uuid) which places it under
TILELANG_CACHE_DIR/autotuner, but KernelCache._cleanup_stale_staging_dirs() only
scans env.TILELANG_CACHE_DIR; fix by either creating staging dirs directly under
env.TILELANG_CACHE_DIR in the staging_path logic (replace parent with
env.TILELANG_CACHE_DIR) or modify KernelCache._cleanup_stale_staging_dirs() to
also scan the autotuner subdirectory (recursively or explicitly include
TILELANG_CACHE_DIR / "autotuner") so orphaned .staging_* directories created by
the autotuner are discovered and removed.

…dir-rename

# Conflicts:
#	tilelang/autotuner/param.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tilelang/autotuner/param.py (1)

411-413: ⚠️ Potential issue | 🟠 Major

Staging path still bypasses existing stale-staging cleanup.

Lines 411-413 create .staging_* under path.parent (e.g., autotuner subdir), while the current cleanup in tilelang/cache/kernel_cache.py scans only env.TILELANG_CACHE_DIR root. Crashed writers can still leave orphan staging dirs here.

💡 Suggested adjustment (reuse existing cleanup scope)
-        parent = path.parent if isinstance(path, Path) else Path(path).parent
-        staging_path = parent / f".staging_{Path(path).name}_{os.getpid()}_{uuid.uuid4().hex[:8]}"
-        os.makedirs(staging_path)
+        cache_root = Path(env.TILELANG_CACHE_DIR)
+        staging_path = cache_root / f".staging_{Path(path).name}_{os.getpid()}_{uuid.uuid4().hex[:8]}"
+        os.makedirs(staging_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tilelang/autotuner/param.py` around lines 411 - 413, The staging directory is
being created under parent (via parent / f".staging_{Path(path).name}_...")
which bypasses the global stale-staging cleanup in
tilelang/cache/kernel_cache.py that only scans env.TILELANG_CACHE_DIR; change
the staging root to the shared cache root so orphaned staging dirs are cleaned:
build staging_path from env.TILELANG_CACHE_DIR (e.g., env.TILELANG_CACHE_DIR /
f".staging_{Path(path).name}_{os.getpid()}_{uuid.uuid4().hex[:8]}") instead of
parent, or alternatively invoke the kernel_cache cleanup helper before writing;
update the code that computes staging_path (reference symbols: staging_path,
parent, Path(path).name) to use env.TILELANG_CACHE_DIR so existing cleanup scope
is reused.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tilelang/autotuner/param.py`:
- Around line 462-467: The current OSError except block around the atomic
autotune result save is too broad: change the except OSError to except OSError
as e and only treat it as the benign race if e.errno is errno.EEXIST or
errno.ENOTEMPTY (call shutil.rmtree(str(staging_path), ignore_errors=True) in
that case); for any other errno, log the error with logger.exception (including
context about staging_path / atomic autotune result save) and re-raise the
exception so real filesystem errors aren’t silently swallowed. Ensure you import
errno if not present and keep the existing generic Exception handler for other
errors.

---

Duplicate comments:
In `@tilelang/autotuner/param.py`:
- Around line 411-413: The staging directory is being created under parent (via
parent / f".staging_{Path(path).name}_...") which bypasses the global
stale-staging cleanup in tilelang/cache/kernel_cache.py that only scans
env.TILELANG_CACHE_DIR; change the staging root to the shared cache root so
orphaned staging dirs are cleaned: build staging_path from
env.TILELANG_CACHE_DIR (e.g., env.TILELANG_CACHE_DIR /
f".staging_{Path(path).name}_{os.getpid()}_{uuid.uuid4().hex[:8]}") instead of
parent, or alternatively invoke the kernel_cache cleanup helper before writing;
update the code that computes staging_path (reference symbols: staging_path,
parent, Path(path).name) to use env.TILELANG_CACHE_DIR so existing cleanup scope
is reused.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 27603822-4784-4aee-9e1b-3a0b3021ecff

📥 Commits

Reviewing files that changed from the base of the PR and between 08976a2 and d76c1c1.

📒 Files selected for processing (1)
  • tilelang/autotuner/param.py

Comment on lines +462 to +467
except OSError:
# Race: another process already created the final path.
shutil.rmtree(str(staging_path), ignore_errors=True)
except Exception:
shutil.rmtree(str(staging_path), ignore_errors=True)
logger.exception("Error during atomic autotune result save")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify atomic rename handlers and OSError treatment in relevant modules.
rg -n --type=py -C5 'os\.rename\(|except OSError' tilelang/autotuner/param.py tilelang/cache/kernel_cache.py

Repository: tile-ai/tilelang

Length of output: 3040


🏁 Script executed:

#!/bin/bash
# Check imports and full context in tilelang/autotuner/param.py
head -50 tilelang/autotuner/param.py | cat -n

Repository: tile-ai/tilelang

Length of output: 2047


🏁 Script executed:

#!/bin/bash
# Check if errno is already imported
rg -n "^import errno|^from errno" tilelang/autotuner/param.py

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Get full context around the problematic code (lines 450-475)
sed -n '450,475p' tilelang/autotuner/param.py | cat -n

Repository: tile-ai/tilelang

Length of output: 1244


🏁 Script executed:

#!/bin/bash
# Check the imports section more comprehensively
head -100 tilelang/autotuner/param.py | grep -n "^import\|^from" | head -30

Repository: tile-ai/tilelang

Length of output: 604


🏁 Script executed:

#!/bin/bash
# Search for tests related to save_to_disk or AutotuneResult
fd -type f -name "*.py" | xargs grep -l "save_to_disk\|AutotuneResult" | head -10

Repository: tile-ai/tilelang

Length of output: 230


🏁 Script executed:

#!/bin/bash
# Check if there are any existing errno checks in the codebase
rg -n "errno\." tilelang/ --type=py | head -20

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Look for any exception handling of EEXIST or race conditions
rg -n "EEXIST|ENOTEMPTY|FileExistsError" tilelang/ --type=py

Repository: tile-ai/tilelang

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Check for all except OSError patterns in param.py
rg -n "except OSError" tilelang/autotuner/param.py

Repository: tile-ai/tilelang

Length of output: 87


🏁 Script executed:

#!/bin/bash
# Get full context of the kernel_cache.py error handling (lines 408-420)
sed -n '408,420p' tilelang/cache/kernel_cache.py | cat -n

Repository: tile-ai/tilelang

Length of output: 826


🏁 Script executed:

#!/bin/bash
# Check if kernel_cache.py also has errno imported
head -50 tilelang/cache/kernel_cache.py | grep -n "^import\|^from"

Repository: tile-ai/tilelang

Length of output: 646


Narrow OSError handling to specific race condition errno values.

Line 462 catches all OSError as if it were a race condition, which silently masks real filesystem errors (permissions, cross-device links, read-only filesystem) and drops cache writes without logging.

The race condition on os.rename() only occurs when the destination already exists (errno.EEXIST or errno.ENOTEMPTY). Other OS-level failures should be logged and re-raised.

🔧 Narrow race handling to expected errno values
+import errno
...
-        except OSError:
-            # Race: another process already created the final path.
-            shutil.rmtree(str(staging_path), ignore_errors=True)
+        except OSError as e:
+            shutil.rmtree(str(staging_path), ignore_errors=True)
+            # Expected race only when destination was created concurrently.
+            if e.errno not in (errno.EEXIST, errno.ENOTEMPTY):
+                logger.exception("Error during atomic autotune result save")
+                raise
🧰 Tools
🪛 Ruff (0.15.7)

[warning] 465-465: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tilelang/autotuner/param.py` around lines 462 - 467, The current OSError
except block around the atomic autotune result save is too broad: change the
except OSError to except OSError as e and only treat it as the benign race if
e.errno is errno.EEXIST or errno.ENOTEMPTY (call
shutil.rmtree(str(staging_path), ignore_errors=True) in that case); for any
other errno, log the error with logger.exception (including context about
staging_path / atomic autotune result save) and re-raise the exception so real
filesystem errors aren’t silently swallowed. Ensure you import errno if not
present and keep the existing generic Exception handler for other errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants