[Enhancement] Use atomic directory rename for cache writes#1982
[Enhancement] Use atomic directory rename for cache writes#1982LeiWang1999 wants to merge 2 commits intomainfrom
Conversation
Replace per-file os.replace with write-to-staging + os.rename to ensure other processes never see an incomplete cache directory. Also add cleanup of stale staging dirs on init.
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
📝 WalkthroughWalkthroughBoth modules now write kernel/artifact outputs into unique per-call staging directories and atomically publish them by renaming the staging directory to the final path; stale staging dirs are cleaned up and rename races are handled with targeted cleanup and logging. (47 words) Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Autotuner
participant KernelCache
participant Filesystem
Autotuner->>Filesystem: check if final `path` exists
alt final path exists
Filesystem-->>Autotuner: exists -> fast return
else
Autotuner->>Filesystem: create unique staging dir `.staging_<uuid>` under cache/root
Autotuner->>Filesystem: write artifacts (best_config, function, out_idx, latency, kernel files) into staging
Autotuner->>KernelCache: invoke publish (rename staging -> final path)
KernelCache->>Filesystem: attempt atomic rename(staging, final)
alt rename success
Filesystem-->>KernelCache: renamed
KernelCache-->>Autotuner: publish complete
else rename OSError (race)
Filesystem-->>KernelCache: OSError
KernelCache->>Filesystem: remove staging dir
KernelCache-->>Autotuner: treat as published by other process
else other Exception
KernelCache->>Filesystem: remove staging dir
KernelCache-->>Autotuner: log exception and fail
end
end
Note right of KernelCache: periodic cleanup invoked during init:\nremove `.staging_...` older than TTL
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tilelang/autotuner/param.py (1)
430-435: Error handling pattern is reasonable.The
OSErrorcatch handles the benign race where another process created the final path. The broadExceptioncatch (flagged by Ruff) is acceptable here since it ensures cleanup and logs the actual exception vialogger.exception(). However, consider re-raising after cleanup if callers need to know the save failed:♻️ Optional: re-raise after logging
except Exception: shutil.rmtree(str(staging_path), ignore_errors=True) logger.exception("Error during atomic autotune result save") + raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tilelang/autotuner/param.py` around lines 430 - 435, The current exception handling around the atomic autotune result save swallows non-OSError exceptions after cleaning up staging_path and logging via logger.exception("Error during atomic autotune result save"); update the except Exception block in the save routine to perform the same cleanup (shutil.rmtree(str(staging_path), ignore_errors=True)), call logger.exception with the existing message, and then re-raise the exception so callers (callers of the atomic save function that uses staging_path) can detect the failure; reference the staging_path variable and the logger.exception call when locating the try/except to modify.tilelang/cache/kernel_cache.py (1)
384-413: Partial cache entries can be committed if individual saves fail.The inner
try-exceptblocks log errors but allow execution to continue to theos.rename. If, for example,_save_kernel_source_code_to_diskfails but other saves succeed, an incomplete staging directory is still atomically renamed to the final path.While
_load_kernel_from_diskand_build_kernelhandle missing components gracefully (returningNone), this means:
- Incomplete cache entries occupy disk space permanently
- Every subsequent load attempt must re-check and fail
Consider tracking save success and only renaming if all critical files were written:
♻️ Optional: gate rename on successful saves
+ save_ok = True try: # Save kernel source code try: self._save_kernel_source_code_to_disk(kernel, staging_path, verbose) except Exception: self.logger.exception("Error saving kernel source code to disk") + save_ok = False # ... similar for other saves ... + if not save_ok: + shutil.rmtree(staging_path, ignore_errors=True) + return + # Atomic rename — makes the complete directory visible in one step. os.rename(staging_path, cache_path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tilelang/cache/kernel_cache.py` around lines 384 - 413, The staging directory is being renamed to the final cache_path even if some inner saves fail; update KernelCache.save (the block around _save_kernel_source_code_to_disk, _save_wrapper_kernel_code_to_disk, _save_so_cubin_to_disk and KernelCache._safe_write_file) to track success for each critical write (e.g., set an all_succeeded flag, flip to False on any Exception), and only perform os.rename(staging_path, cache_path) when all_succeeded is True; if not all succeeded, log a clear error and remove the incomplete staging directory (shutil.rmtree or equivalent) to avoid persisting partial cache entries and avoid committing partial state that _load_kernel_from_disk/_build_kernel would later revisit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tilelang/autotuner/param.py`:
- Around line 377-383: The staging directory for the autotuner is created under
parent (in param.py within the staging_path creation block using Path(path).name
and os.getpid/uuid) which places it under TILELANG_CACHE_DIR/autotuner, but
KernelCache._cleanup_stale_staging_dirs() only scans env.TILELANG_CACHE_DIR; fix
by either creating staging dirs directly under env.TILELANG_CACHE_DIR in the
staging_path logic (replace parent with env.TILELANG_CACHE_DIR) or modify
KernelCache._cleanup_stale_staging_dirs() to also scan the autotuner
subdirectory (recursively or explicitly include TILELANG_CACHE_DIR /
"autotuner") so orphaned .staging_* directories created by the autotuner are
discovered and removed.
---
Nitpick comments:
In `@tilelang/autotuner/param.py`:
- Around line 430-435: The current exception handling around the atomic autotune
result save swallows non-OSError exceptions after cleaning up staging_path and
logging via logger.exception("Error during atomic autotune result save"); update
the except Exception block in the save routine to perform the same cleanup
(shutil.rmtree(str(staging_path), ignore_errors=True)), call logger.exception
with the existing message, and then re-raise the exception so callers (callers
of the atomic save function that uses staging_path) can detect the failure;
reference the staging_path variable and the logger.exception call when locating
the try/except to modify.
In `@tilelang/cache/kernel_cache.py`:
- Around line 384-413: The staging directory is being renamed to the final
cache_path even if some inner saves fail; update KernelCache.save (the block
around _save_kernel_source_code_to_disk, _save_wrapper_kernel_code_to_disk,
_save_so_cubin_to_disk and KernelCache._safe_write_file) to track success for
each critical write (e.g., set an all_succeeded flag, flip to False on any
Exception), and only perform os.rename(staging_path, cache_path) when
all_succeeded is True; if not all succeeded, log a clear error and remove the
incomplete staging directory (shutil.rmtree or equivalent) to avoid persisting
partial cache entries and avoid committing partial state that
_load_kernel_from_disk/_build_kernel would later revisit.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4881e1de-890a-4c7c-8e76-58b98e3f8e86
📒 Files selected for processing (2)
tilelang/autotuner/param.pytilelang/cache/kernel_cache.py
| # Already saved (e.g. another process won the race). | ||
| if os.path.isdir(path): | ||
| return | ||
|
|
||
| # save function (atomic) | ||
| if verbose: | ||
| logger.debug(f"Saving function to file: {path / FUNCTION_PATH}") | ||
| self._safe_write_file(str(path / FUNCTION_PATH), "wb", lambda f: cloudpickle.dump(self.func, f)) | ||
| parent = path.parent if isinstance(path, Path) else Path(path).parent | ||
| staging_path = parent / f".staging_{Path(path).name}_{os.getpid()}_{uuid.uuid4().hex[:8]}" | ||
| os.makedirs(staging_path) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check how save_to_disk is typically called to understand if path is under TILELANG_CACHE_DIR
rg -n "save_to_disk" --type=py -C5 | head -100
# Check if there's any cleanup mechanism for param.py staging dirs
rg -n "_cleanup_stale_staging" --type=pyRepository: tile-ai/tilelang
Length of output: 2769
🏁 Script executed:
# Check Tuner class initialization to see how cache_dir is set
rg -n "class Tuner" --type=py -A20 | head -50
# Get the full implementation of _cleanup_stale_staging_dirs
rg -n "_cleanup_stale_staging_dirs" --type=py -A20
# Check the full save_to_disk method to understand staging path logic
rg -n "def save_to_disk" --type=py -A30 tilelang/autotuner/param.pyRepository: tile-ai/tilelang
Length of output: 3283
🏁 Script executed:
# Find Tuner class __init__ and cache_dir initialization
rg -n "class Tuner" --type=py -A40 tilelang/autotuner/tuner.py | head -80
# Also search for any other calls to save_to_disk to see what paths are used
rg -n "save_to_disk" --type=py -B2 -A2Repository: tile-ai/tilelang
Length of output: 1138
🏁 Script executed:
# Get Tuner class __init__ more directly
sed -n '1,200p' tilelang/autotuner/tuner.py | grep -n -A30 "def __init__"
# Or search for cache_dir assignment in tuner.py
rg -n "self.cache_dir" --type=py tilelang/autotuner/tuner.pyRepository: tile-ai/tilelang
Length of output: 1431
🏁 Script executed:
# Search for cache_dir assignment
rg -n "cache_dir\s*=" --type=py tilelang/autotuner/tuner.py
# Get full Tuner class definition
rg -n "^class Tuner" --type=py -A100 tilelang/autotuner/tuner.py | head -150Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
# Get context around line 309 to find which class it belongs to
sed -n '200,350p' tilelang/autotuner/tuner.py | grep -n -B50 "save_to_disk" | head -80
# Search for cache_dir definition anywhere
rg -n "cache_dir" --type=py tilelang/autotuner/ -B2 -A2Repository: tile-ai/tilelang
Length of output: 3440
🏁 Script executed:
# Check if there's any other cleanup mechanism for autotuner staging dirs
rg -n "staging" --type=py tilelang/autotuner/
# Also check if there's any cleanup in AutotuneResult class
rg -n "cleanup|rmtree|remove" --type=py tilelang/autotuner/param.pyRepository: tile-ai/tilelang
Length of output: 1938
Staging directory cleanup does not handle autotuner subdirectory.
The staging directory is created at parent / ".staging_..." where parent = (TILELANG_CACHE_DIR / "autotuner"), but KernelCache._cleanup_stale_staging_dirs() only scans env.TILELANG_CACHE_DIR directly (line 146 in kernel_cache.py). Orphaned staging directories left by crashed processes in TILELANG_CACHE_DIR/autotuner/.staging_* will never be cleaned up.
Consider either:
- Place staging directories directly under
env.TILELANG_CACHE_DIRinstead of in the autotuner subdirectory, or - Update
_cleanup_stale_staging_dirs()to recursively scan subdirectories (or add a separate cleanup call specific to the autotuner directory).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tilelang/autotuner/param.py` around lines 377 - 383, The staging directory
for the autotuner is created under parent (in param.py within the staging_path
creation block using Path(path).name and os.getpid/uuid) which places it under
TILELANG_CACHE_DIR/autotuner, but KernelCache._cleanup_stale_staging_dirs() only
scans env.TILELANG_CACHE_DIR; fix by either creating staging dirs directly under
env.TILELANG_CACHE_DIR in the staging_path logic (replace parent with
env.TILELANG_CACHE_DIR) or modify KernelCache._cleanup_stale_staging_dirs() to
also scan the autotuner subdirectory (recursively or explicitly include
TILELANG_CACHE_DIR / "autotuner") so orphaned .staging_* directories created by
the autotuner are discovered and removed.
…dir-rename # Conflicts: # tilelang/autotuner/param.py
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
tilelang/autotuner/param.py (1)
411-413:⚠️ Potential issue | 🟠 MajorStaging path still bypasses existing stale-staging cleanup.
Lines 411-413 create
.staging_*underpath.parent(e.g., autotuner subdir), while the current cleanup intilelang/cache/kernel_cache.pyscans onlyenv.TILELANG_CACHE_DIRroot. Crashed writers can still leave orphan staging dirs here.💡 Suggested adjustment (reuse existing cleanup scope)
- parent = path.parent if isinstance(path, Path) else Path(path).parent - staging_path = parent / f".staging_{Path(path).name}_{os.getpid()}_{uuid.uuid4().hex[:8]}" - os.makedirs(staging_path) + cache_root = Path(env.TILELANG_CACHE_DIR) + staging_path = cache_root / f".staging_{Path(path).name}_{os.getpid()}_{uuid.uuid4().hex[:8]}" + os.makedirs(staging_path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tilelang/autotuner/param.py` around lines 411 - 413, The staging directory is being created under parent (via parent / f".staging_{Path(path).name}_...") which bypasses the global stale-staging cleanup in tilelang/cache/kernel_cache.py that only scans env.TILELANG_CACHE_DIR; change the staging root to the shared cache root so orphaned staging dirs are cleaned: build staging_path from env.TILELANG_CACHE_DIR (e.g., env.TILELANG_CACHE_DIR / f".staging_{Path(path).name}_{os.getpid()}_{uuid.uuid4().hex[:8]}") instead of parent, or alternatively invoke the kernel_cache cleanup helper before writing; update the code that computes staging_path (reference symbols: staging_path, parent, Path(path).name) to use env.TILELANG_CACHE_DIR so existing cleanup scope is reused.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tilelang/autotuner/param.py`:
- Around line 462-467: The current OSError except block around the atomic
autotune result save is too broad: change the except OSError to except OSError
as e and only treat it as the benign race if e.errno is errno.EEXIST or
errno.ENOTEMPTY (call shutil.rmtree(str(staging_path), ignore_errors=True) in
that case); for any other errno, log the error with logger.exception (including
context about staging_path / atomic autotune result save) and re-raise the
exception so real filesystem errors aren’t silently swallowed. Ensure you import
errno if not present and keep the existing generic Exception handler for other
errors.
---
Duplicate comments:
In `@tilelang/autotuner/param.py`:
- Around line 411-413: The staging directory is being created under parent (via
parent / f".staging_{Path(path).name}_...") which bypasses the global
stale-staging cleanup in tilelang/cache/kernel_cache.py that only scans
env.TILELANG_CACHE_DIR; change the staging root to the shared cache root so
orphaned staging dirs are cleaned: build staging_path from
env.TILELANG_CACHE_DIR (e.g., env.TILELANG_CACHE_DIR /
f".staging_{Path(path).name}_{os.getpid()}_{uuid.uuid4().hex[:8]}") instead of
parent, or alternatively invoke the kernel_cache cleanup helper before writing;
update the code that computes staging_path (reference symbols: staging_path,
parent, Path(path).name) to use env.TILELANG_CACHE_DIR so existing cleanup scope
is reused.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 27603822-4784-4aee-9e1b-3a0b3021ecff
📒 Files selected for processing (1)
tilelang/autotuner/param.py
| except OSError: | ||
| # Race: another process already created the final path. | ||
| shutil.rmtree(str(staging_path), ignore_errors=True) | ||
| except Exception: | ||
| shutil.rmtree(str(staging_path), ignore_errors=True) | ||
| logger.exception("Error during atomic autotune result save") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify atomic rename handlers and OSError treatment in relevant modules.
rg -n --type=py -C5 'os\.rename\(|except OSError' tilelang/autotuner/param.py tilelang/cache/kernel_cache.pyRepository: tile-ai/tilelang
Length of output: 3040
🏁 Script executed:
#!/bin/bash
# Check imports and full context in tilelang/autotuner/param.py
head -50 tilelang/autotuner/param.py | cat -nRepository: tile-ai/tilelang
Length of output: 2047
🏁 Script executed:
#!/bin/bash
# Check if errno is already imported
rg -n "^import errno|^from errno" tilelang/autotuner/param.pyRepository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Get full context around the problematic code (lines 450-475)
sed -n '450,475p' tilelang/autotuner/param.py | cat -nRepository: tile-ai/tilelang
Length of output: 1244
🏁 Script executed:
#!/bin/bash
# Check the imports section more comprehensively
head -100 tilelang/autotuner/param.py | grep -n "^import\|^from" | head -30Repository: tile-ai/tilelang
Length of output: 604
🏁 Script executed:
#!/bin/bash
# Search for tests related to save_to_disk or AutotuneResult
fd -type f -name "*.py" | xargs grep -l "save_to_disk\|AutotuneResult" | head -10Repository: tile-ai/tilelang
Length of output: 230
🏁 Script executed:
#!/bin/bash
# Check if there are any existing errno checks in the codebase
rg -n "errno\." tilelang/ --type=py | head -20Repository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Look for any exception handling of EEXIST or race conditions
rg -n "EEXIST|ENOTEMPTY|FileExistsError" tilelang/ --type=pyRepository: tile-ai/tilelang
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Check for all except OSError patterns in param.py
rg -n "except OSError" tilelang/autotuner/param.pyRepository: tile-ai/tilelang
Length of output: 87
🏁 Script executed:
#!/bin/bash
# Get full context of the kernel_cache.py error handling (lines 408-420)
sed -n '408,420p' tilelang/cache/kernel_cache.py | cat -nRepository: tile-ai/tilelang
Length of output: 826
🏁 Script executed:
#!/bin/bash
# Check if kernel_cache.py also has errno imported
head -50 tilelang/cache/kernel_cache.py | grep -n "^import\|^from"Repository: tile-ai/tilelang
Length of output: 646
Narrow OSError handling to specific race condition errno values.
Line 462 catches all OSError as if it were a race condition, which silently masks real filesystem errors (permissions, cross-device links, read-only filesystem) and drops cache writes without logging.
The race condition on os.rename() only occurs when the destination already exists (errno.EEXIST or errno.ENOTEMPTY). Other OS-level failures should be logged and re-raised.
🔧 Narrow race handling to expected errno values
+import errno
...
- except OSError:
- # Race: another process already created the final path.
- shutil.rmtree(str(staging_path), ignore_errors=True)
+ except OSError as e:
+ shutil.rmtree(str(staging_path), ignore_errors=True)
+ # Expected race only when destination was created concurrently.
+ if e.errno not in (errno.EEXIST, errno.ENOTEMPTY):
+ logger.exception("Error during atomic autotune result save")
+ raise🧰 Tools
🪛 Ruff (0.15.7)
[warning] 465-465: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tilelang/autotuner/param.py` around lines 462 - 467, The current OSError
except block around the atomic autotune result save is too broad: change the
except OSError to except OSError as e and only treat it as the benign race if
e.errno is errno.EEXIST or errno.ENOTEMPTY (call
shutil.rmtree(str(staging_path), ignore_errors=True) in that case); for any
other errno, log the error with logger.exception (including context about
staging_path / atomic autotune result save) and re-raise the exception so real
filesystem errors aren’t silently swallowed. Ensure you import errno if not
present and keep the existing generic Exception handler for other errors.
Summary
os.replacewith write-to-staging +os.renamefor bothKernelCacheandAutotuneResult, ensuring other processes never observe an incomplete cache directory._cleanup_stale_staging_dirs()to remove orphaned staging directories (>1h old) left by crashed processes, called onKernelCachesingleton init._save_kernel_source_code_to_disk,_save_wrapper_kernel_code_to_disk,_save_so_cubin_to_disk) require no changes — they receive the staging path ascache_path.Test plan
tvm_fficache tests pass (pytest testing/python/cache/test_tilelang_kernel_cache.py -v).staging_*directories remain after successful writesSummary by CodeRabbit