Skip to content

feat: Round-up file-sizes by block-size in file-cache (opt-out by experimental-file-cache-disable-size-calculation-fix)#4514

Merged
gargnitingoogle merged 27 commits intomasterfrom
gargnitin/cache-dir-size-fix/phase1/v1
Mar 27, 2026
Merged

feat: Round-up file-sizes by block-size in file-cache (opt-out by experimental-file-cache-disable-size-calculation-fix)#4514
gargnitingoogle merged 27 commits intomasterfrom
gargnitin/cache-dir-size-fix/phase1/v1

Conversation

@gargnitingoogle
Copy link
Copy Markdown
Contributor

@gargnitingoogle gargnitingoogle commented Mar 23, 2026

Description

  • This fixes the accounting of the sizes of the cached files in the cache-dir, by rounding them up to the block-size of the cache-dir. This is the first phase of the fix needed to contain the actual disk-utilization of cache-dir at the specified limit of file-cache-max-size-mb.
  • Adds new experimental hidden flag new hidden flag
    experimental-file-cache-disable-size-calculation-fix to disable the fix, if needed as a fallback for unexpected behavior change.

Note: This fix makes no differences to the cases where file-cache-max-size-mb is specified as -1, which is the default value for that flag. So, this fix doesn't change anything unless file-cache-max-size-mb is explicitly set to a value other than -1.

Note: This fix doesn't completely ensure that the actual disk-size of cache-dir doesn't go beyond file-cache-max-size-mb as it doesn't address the space taken up by directories in cache-dir. It also doesn't clear up empty directories in cache-dir. For that, more changes are planned in further phases.

Note: This fix doesn't apply when the cache is sparse-enabled (i.e. file-cache-experimental-enable-chunk-cache=true), as the file-size rounding-up is not needed for them, so this fix is disabled for that case.

Link to the issue in case of a bug fix.

b/496408968

Testing details

  1. Manual - Tested locally with file-cache-max-size-mb set to 1 and number of directories ~= number of files in the read workload. With that set up, I let a multi-thread read application run for about 2 minutes. Saw the effect of the change.
    • Without the fix
      • the size of cache-dir (output of du -sh <cache-dir>) grew up to 4.5 MB
      • Number of files in cache-dir grew to about 554 files.
    • With the fix
      • the size of cache-dir grew to and stayed at about 2.5 MB (because of 1.5 MB taken by directories)
      • the number of files in the cache-dir grew to and stayed at 256 (all small files in the test set-up), which makes sense because 4k * 256 = 1 MB.
  2. Unit tests - Added/updated unit tests for the changes.
  3. Integration tests - Ran as part of presubmit.
    • e2e tests for non-zb:
    • e2e test with zb:
    • perf run:
      +--------+------------+--------------+------------+--------------+--------------+
      | Branch | File Size  |   Read BW    |  Write BW  | RandRead BW  | RandWrite BW |
      +--------+------------+--------------+------------+--------------+--------------+
      | Master |  0.25MiB   | 558.53MiB/s  |  1.3MiB/s  |  80.15MiB/s  |  1.11MiB/s   |
      |   PR   |  0.25MiB   | 576.11MiB/s  | 1.05MiB/s  |  82.28MiB/s  |  1.27MiB/s   |
      |        |            |              |            |              |              |
      |        |            |              |            |              |              |
      | Master | 48.828MiB  | 4410.43MiB/s | 78.73MiB/s | 1546.12MiB/s |  79.43MiB/s  |
      |   PR   | 48.828MiB  | 4450.18MiB/s | 77.0MiB/s  | 1567.94MiB/s |  78.95MiB/s  |
      |        |            |              |            |              |              |
      |        |            |              |            |              |              |
      | Master | 976.562MiB | 4432.69MiB/s | 35.19MiB/s | 765.57MiB/s  |  38.76MiB/s  |
      |   PR   | 976.562MiB | 4429.37MiB/s | 35.21MiB/s | 1089.37MiB/s |  39.1MiB/s   |
      |        |            |              |            |              |              |
      |        |            |              |            |              |              |
      +--------+------------+--------------+------------+--------------+--------------+
    • Scale tests
      • Checkpointing benchmark
        • With fix: Run in progress. Will add data when available.

Any backward incompatible change? If so, please explain.

The fix can potentially change the user experience indirectly, as the cache-dir disk-utilization will now be closer to file-cache-max-size-mb, and it can reduce how many files are actually cached, if there are many small files.

@gargnitingoogle gargnitingoogle added the execute-integration-tests Run only integration tests label Mar 23, 2026
@gargnitingoogle gargnitingoogle force-pushed the gargnitin/cache-dir-size-fix/phase1/v1 branch from e9f7409 to c2ebcf2 Compare March 23, 2026 04:45
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 23, 2026

Codecov Report

❌ Patch coverage is 91.30435% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.65%. Comparing base (29c57be) to head (50e7f18).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
internal/util/diskutil/disk_util.go 70.00% 4 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4514      +/-   ##
==========================================
+ Coverage   83.63%   83.65%   +0.01%     
==========================================
  Files         163      164       +1     
  Lines       20150    20185      +35     
==========================================
+ Hits        16853    16885      +32     
- Misses       2661     2663       +2     
- Partials      636      637       +1     
Flag Coverage Δ
unittests 83.65% <91.30%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gargnitingoogle
Copy link
Copy Markdown
Contributor Author

/gemini review

@gargnitingoogle
Copy link
Copy Markdown
Contributor Author

/gemini summary

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix for file cache size accounting by considering the filesystem block size, enabled by default via a new experimental flag. The implementation extends the LRU cache to support custom size calculation functions and adds new utility functions to determine disk usage. The changes are well-structured. However, there is a critical bug in the LRU cache implementation when updating existing entries, which could lead to incorrect cache size management. Please see the specific comment for details.

Comment thread internal/cache/lru/lru.go Outdated
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

This pull request introduces a significant improvement to how file cache sizes are accounted for, ensuring that the reported cache usage more accurately reflects the actual disk space consumed. By rounding up cached file sizes to the nearest filesystem block size, the system can better adhere to configured cache limits. An experimental flag has been added to provide flexibility in controlling this new behavior, which is a foundational step towards more robust cache management.

Highlights

  • File Cache Size Accounting Fix: Implemented a fix to accurately account for cached file sizes by rounding them up to the filesystem's block size, addressing discrepancies between reported and actual disk usage. This is the first phase in ensuring the file-cache-max-size-mb limit is honored.
  • Experimental Flag for Disabling Fix: Introduced a new experimental and hidden flag, experimental-file-cache-disable-size-calculation-fix, allowing users to disable this new block-size-aware accounting if needed. The fix is enabled by default when file-cache-max-size-mb is explicitly set.
  • LRU Cache Refactoring for Custom Size Calculation: Refactored the Least Recently Used (LRU) cache to support custom size calculation functions. This enables the cache to use the new block-level accounting for its entries, providing more precise capacity management.
  • Disk Utility Functions: Added new utility functions to calculate the speculative disk space a file will consume based on block size and to retrieve the actual block size of a given volume.
Changelog
  • cfg/config.go
    • Added ExperimentalDisableSizeCalculationFix field to the FileCacheConfig struct.
    • Registered the experimental-file-cache-disable-size-calculation-fix flag in BuildFlagSet.
    • Bound the new flag to its configuration path in BindFlags.
  • cfg/params.yaml
    • Defined the new experimental-file-cache-disable-size-calculation-fix flag with its type, usage, default value, and hidden status.
  • cmd/config_validation_test.go
    • Updated defaultFileCacheConfig to include the ExperimentalDisableSizeCalculationFix field.
    • Modified TestValidateConfigFile_FileCacheConfigSuccessful to incorporate the new experimental flag in test configurations.
  • cmd/root_test.go
    • Adjusted TestArgsParsing_FileCacheFlags to include the ExperimentalDisableSizeCalculationFix field in file cache configurations.
  • internal/cache/file/cache_handler.go
    • Imported the new baseutil package.
    • Added a sizeCalcFix boolean field to the CacheHandler struct.
    • Modified the NewCacheHandler constructor to accept a sizeCalcFix parameter.
    • Conditionally set a custom size calculation function on the fileInfoCache based on the sizeCalcFix value, using GetVolumeBlockSize and GetSpeculativeFileSizeOnDisk.
  • internal/cache/file/cache_handler_test.go
    • Updated the NewCacheHandler call in initializeCacheHandlerTestArgs to pass false for the new sizeCalcFix parameter.
  • internal/cache/lru/lru.go
    • Added a sizeCalcFunc field to the Cache struct to hold a custom size calculation function.
    • Introduced defaultSizeCalc for standard size calculation.
    • Refactored NewCache to call the new NewCacheWithCustomSizeCalc.
    • Added NewCacheWithCustomSizeCalc to allow initializing the cache with a custom size calculation function.
    • Updated evictOne, Insert, Erase, and UpdateWithoutChangingOrder methods to use c.sizeCalcFunc for size accounting.
    • Added a SetSizeCalcFunc method to dynamically change the size calculation logic.
  • internal/fs/fs.go
    • Modified createSingleMountFileCacheHandler to pass the inverse of serverCfg.NewConfig.FileCache.ExperimentalDisableSizeCalculationFix to the NewCacheHandler.
  • internal/gcsx/file_cache_reader_test.go
    • Updated the NewCacheHandler call in SetupTest to include the new sizeCalcFix parameter, set to false.
  • internal/gcsx/random_reader_stretchr_test.go
    • Modified the NewCacheHandler call in SetupTest to include the new sizeCalcFix parameter, set to false.
  • internal/gcsx/random_reader_test.go
    • Adjusted the NewCacheHandler call in SetUp to include the new sizeCalcFix parameter, set to false.
  • internal/gcsx/read_manager/read_manager_test.go
    • Updated the NewCacheHandler call in readManagerConfig to include the new sizeCalcFix parameter, set to false.
  • internal/util/disk_util.go
    • Added GetSpeculativeFileSizeOnDisk function to calculate disk space based on file content size and volume block size.
    • Added GetVolumeBlockSize function to retrieve the filesystem block size for a given path.
  • internal/util/disk_util_test.go
    • Added unit tests for GetSpeculativeFileSizeOnDisk covering various scenarios.
    • Added a unit test for GetVolumeBlockSize to verify its functionality.
Activity
  • Codecov reported 84.21% patch coverage, noting 6 lines missing coverage in internal/cache/file/cache_handler.go and internal/util/disk_util.go.
  • The author, gargnitingoogle, requested a Gemini review.
  • A critical review comment from gemini-code-assist[bot] identified a bug in internal/cache/lru/lru.go within the Insert function, where the old entry's size was subtracted without using the custom size calculation function, leading to incorrect cache accounting. A suggested fix was provided.

@gargnitingoogle gargnitingoogle added execute-perf-test Execute performance test in PR and removed execute-integration-tests Run only integration tests labels Mar 23, 2026
@gargnitingoogle gargnitingoogle force-pushed the gargnitin/cache-dir-size-fix/phase1/v1 branch from c2ebcf2 to a513656 Compare March 23, 2026 05:43
@gargnitingoogle gargnitingoogle added execute-integration-tests Run only integration tests and removed execute-perf-test Execute performance test in PR labels Mar 23, 2026
@gargnitingoogle gargnitingoogle marked this pull request as ready for review March 23, 2026 07:30
@gargnitingoogle gargnitingoogle requested a review from a team as a code owner March 23, 2026 07:30
@github-actions github-actions Bot added the remind-reviewers Auto remind reviewers in attention set for review post 24hrs of inactivity on PR. label Mar 23, 2026
@gargnitingoogle gargnitingoogle requested review from kislaykishore and removed request for abhishek10004 March 23, 2026 07:30
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix for file cache size accounting by considering the filesystem block size. It adds a new experimental flag to disable this behavior. The changes involve modifying the LRU cache to support custom size calculation functions and adding utility functions to get disk properties. My review focuses on a potential bug in the new LRU cache API, adherence to the project's code generation process, and general code quality improvements such as removing magic numbers and strengthening tests.

Comment thread cfg/config.go
Comment thread internal/cache/lru/lru.go Outdated
Comment thread internal/cache/file/cache_handler.go Outdated
Comment thread internal/util/disk_util_test.go Outdated
@gargnitingoogle gargnitingoogle marked this pull request as draft March 23, 2026 08:30
@gargnitingoogle gargnitingoogle force-pushed the gargnitin/cache-dir-size-fix/phase1/v1 branch 2 times, most recently from 059e6f0 to 5d4f2ce Compare March 23, 2026 10:36
@gargnitingoogle gargnitingoogle changed the title feat: Add experimental-file-cache-disable-size-calculation-fix feat: Round-up file-sizes by block-size in file-cache (opt-out by experimental-file-cache-disable-size-calculation-fix) Mar 23, 2026
@gargnitingoogle
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix for file cache size accounting by rounding up file sizes to the filesystem's block size, providing a more accurate measure of disk utilization. This is controlled by a new experimental flag, experimental-file-cache-disable-size-calculation-fix, which is off by default (meaning the fix is active). The implementation involves significant and well-executed modifications to the LRU cache to support custom size calculation functions, along with new utility functions for disk properties. The changes are accompanied by thorough unit and integration tests.

My main feedback is regarding a confusing log message in internal/cache/file/cache_handler.go that incorrectly states the reason for the size calculation fix being disabled. I've provided a suggestion to correct the logic and improve clarity for users.

Comment thread internal/cache/file/cache_handler.go Outdated
@gargnitingoogle gargnitingoogle marked this pull request as ready for review March 23, 2026 13:12
@gargnitingoogle gargnitingoogle added execute-integration-tests-on-zb To run E2E tests on zonal bucket. and removed execute-integration-tests Run only integration tests labels Mar 23, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable fix for file cache size accounting by rounding up file sizes to the filesystem's block size, leading to more accurate cache size management. The implementation is robust, introducing a flexible sizing mechanism in the LRU cache and encapsulating disk utility functions cleanly. The changes are well-supported by comprehensive unit tests, and the new configuration flag is correctly integrated. I have identified a minor issue with the copyright year in the newly added files, and have provided guidance on how to address it, especially if the files are auto-generated.

Comment thread internal/util/diskutil/disk_util.go
Comment thread internal/util/diskutil/disk_util_test.go
@gargnitingoogle gargnitingoogle force-pushed the gargnitin/cache-dir-size-fix/phase1/v1 branch from 39049d4 to f6af215 Compare March 24, 2026 04:54
@gargnitingoogle gargnitingoogle force-pushed the gargnitin/cache-dir-size-fix/phase1/v1 branch from 146c6cb to a2d4efa Compare March 26, 2026 17:21
@gargnitingoogle gargnitingoogle force-pushed the gargnitin/cache-dir-size-fix/phase1/v1 branch from a2d4efa to c75ddfc Compare March 26, 2026 17:25
Comment thread internal/util/diskutil/disk_util.go Outdated
@gargnitingoogle gargnitingoogle force-pushed the gargnitin/cache-dir-size-fix/phase1/v1 branch from 663608f to 8bdec39 Compare March 27, 2026 04:47
Comment thread internal/util/diskutil/disk_util.go
Comment thread internal/util/diskutil/disk_util_test.go Outdated
@gargnitingoogle gargnitingoogle added execute-integration-tests Run only integration tests and removed execute-integration-tests Run only integration tests labels Mar 27, 2026
@gargnitingoogle gargnitingoogle removed the execute-integration-tests Run only integration tests label Mar 27, 2026
@gargnitingoogle gargnitingoogle merged commit 582a220 into master Mar 27, 2026
21 of 23 checks passed
@gargnitingoogle gargnitingoogle deleted the gargnitin/cache-dir-size-fix/phase1/v1 branch March 27, 2026 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

remind-reviewers Auto remind reviewers in attention set for review post 24hrs of inactivity on PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants