Cache first failure building an overlay base DB to avoid repeated failures#3487
Cache first failure building an overlay base DB to avoid repeated failures#3487henrymercer wants to merge 23 commits intomainfrom
Conversation
Use [...languages].sort() instead of languages.sort() to avoid mutating the caller's array as a side effect.
There was a problem hiding this comment.
Pull request overview
This pull request implements a caching mechanism to avoid repeated failures when overlay analysis (improved incremental analysis) fails on a runner due to insufficient resources, typically disk space. The PR introduces a status tracking system that records failures in the Actions cache, allowing subsequent runs to skip overlay analysis automatically until conditions change (e.g., runner upgrade or new CodeQL version).
Changes:
- Added a new
src/overlay/status.tsmodule to track and persist overlay analysis failures via Actions cache - Modified
src/init-action-post-helper.tsto record failures when overlay base database builds are unsuccessful - Updated
src/config-utils.tsto check cached status and skip overlay analysis when previous failures are detected - Added two new feature flags:
OverlayAnalysisStatusCheckandOverlayAnalysisStatusSave - Modified
bundleDbfunction signature to accept anincludeDiagnosticsparameter - Reorganized overlay-related imports from
overlay-database-utilstooverlaymodule structure
Reviewed changes
Copilot reviewed 29 out of 31 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/overlay/status.ts | New module implementing status persistence using Actions cache with timeout handling |
| src/overlay/status.test.ts | Comprehensive unit tests for the new status tracking functionality |
| src/init-action-post-helper.ts | Integration to save failure status after unsuccessful overlay-base builds |
| src/init-action-post-helper.test.ts | Tests verifying status saving under different conditions (failure, success, disabled) |
| src/config-utils.ts | Integration to check cached status and skip overlay analysis when indicated |
| src/config-utils.test.ts | Tests for skipping overlay analysis based on cached status |
| src/feature-flags.ts | Added two new feature flags for status check and save operations |
| src/util.ts | Modified bundleDb signature to accept includeDiagnostics parameter |
| src/database-upload.ts | Updated bundleDb call to pass includeDiagnostics: false |
| src/debug-artifacts.ts | Updated bundleDb call to pass includeDiagnostics: true |
| src/doc-url.ts | Added new documentation URL for deleting Actions cache entries |
| src/testing-utils.ts | Updated import path from overlay-database-utils to overlay |
| src/status-report.ts | Updated import path from overlay-database-utils to overlay |
| src/overlay/index.ts | Updated import paths to use relative paths from parent directory |
| src/overlay/index.test.ts | Updated imports to match new module structure |
| src/init-action.ts | Updated import path from overlay-database-utils to overlay |
| src/analyze.ts | Updated import path from overlay-database-utils to overlay |
| src/analyze-action.ts | Updated import path from overlay-database-utils to overlay |
| src/codeql.ts | Updated databaseBundle signature to include includeDiagnostics parameter |
| lib/* | Generated JavaScript files reflecting all TypeScript changes |
mbg
left a comment
There was a problem hiding this comment.
Thanks for putting this together! There's a lot going on here, including some things I like a lot. I have left a bunch of detailed comments, and there are also some high-level points:
- From previous experience, we know that we need to be careful about the construction of cache keys. We should consider more thoroughly what we need to include in the keys here to not shoot ourselves in the foot. In particular, I'd like us to better identify the runner, the analysis (thinking about Advanced Setup), and guarding against changes we need to make to the implementation.
- Caches can interact poorly with feature flags, if the feature flags affect what might be in a cache. We currently have a number of FFs which affect the overlay analysis behaviour and analysis in general. We might want to include these in the cache key so that e.g. we don't roll out a feature flag that breaks all overlay base database builds, then roll back the feature flag, but are stuck with caches that indicate failure for two weeks.
- The decision whether a base database build failed is currently local to a single workflow run. Consider a scenario where we successfully built an overlay base database with CodeQL version X and uploaded it. Now we are running again for a new commit, but building the overlay base database fails with the same CodeQL version -- perhaps due to an intermittent failure. We upload the status file and block all future base overlay db builds for this CodeQL version and all PR runs from even trying to download the existing base db, which is still in the cache. Perhaps it would be worth checking whether there is an existing base db for the same CodeQL version in the cache?
| // | ||
| // Limitation: this can still flip from "too small" to "large enough" and back again if the disk | ||
| // space fluctuates above and below a multiple of 10 GB. | ||
| const diskSpaceToNearest10Gb = `${10 * Math.floor(diskUsage.numTotalBytes / (10 * 1024 * 1024 * 1024))}GB`; |
There was a problem hiding this comment.
Design question: this fundamentally assumes that the CodeQL analysis typically runs on comparable runners. I.e. the assumption is that unless the amount of total disk space is increased deliberately, the runner specs are the same. Practically speaking, I'd expect that to be the case as well. However, I am not sure whether it is necessarily the case or this is an assumption we have made previously.
My concern is that, if a customer has a runner group for CodeQL containing runners with different specs, we might flip-flop on this -- I think you express that in the "Limitation" part of the comment. That wouldn't be a great experience for a customer if it happened. Is there a way we can mitigate this?
There was a problem hiding this comment.
Good point, this might be something that's worth checking in telemetry before rolling this out.
| try { | ||
| const foundKey = await waitForResultWithTimeLimit( | ||
| MAX_CACHE_OPERATION_MS, | ||
| actionsCache.restoreCache([statusFile], cacheKey), |
There was a problem hiding this comment.
The paths to restore are an implicit component of the cache key. In this case, if statusFile is different between store and restore, then the cache won't get restored here. Since the path depends on getTemporaryDirectory(), we are dependent on that returning the same path every time.
I see that our dependency caching implementation also relies on getTemporaryDirectory() returning the same path for Java/C#. We should probably check that this doesn't cause any issues and ideally move to something that's more reliably stable.
There was a problem hiding this comment.
Do you have any ideas of how to work around this? It seems desirable to store temporary files under RUNNER_TEMP where possible.
|
@mbg Thanks for the review, I believe I've addressed all your comments, and we discussed the ones in your overall comment offline. |
…cks-v2 Add feature flag for more lenient overlay resource checks
mbg
left a comment
There was a problem hiding this comment.
Happy to approve now. The changes you've made in response to my previous feedback are good and we have discussed the other points in depth elsewhere.
Just one minor comment, don't feel obliged to address it as part of this PR.
| "The attempt to save this failure status to the Actions cache failed. The Action will attempt to " + | ||
| "save this failure status again on the next run, so future runs will skip improved incremental analysis. " + |
There was a problem hiding this comment.
Minor: Simplify the second sentence to "The Action will attempt to run with improved incremental analysis again." since it may not necessarily fail again.
When overlay analysis (improved incremental analysis) fails on a runner — typically due to insufficient disk space — this PR records that failure in the Actions cache so that subsequent runs will skip overlay analysis automatically until something changes (e.g. a larger runner is provisioned or a new CodeQL version is released).
See the backlinked internal issue for more information.
I recommend reviewing the first commit separately from the rest as this moves the overlay utilities into their own directory.
Risk assessment
For internal use only. Please select the risk level of this change:
Which use cases does this change impact?
Workflow types:
dynamicworkflows (Default Setup, CCR, ...).Products:
analysis-kinds: code-scanning.Environments:
github.comand/or GitHub Enterprise Cloud with Data Residency.How did/will you validate this change?
.test.tsfiles).If something goes wrong after this change is released, what are the mitigation and rollback strategies?
How will you know if something goes wrong after this change is released?
Are there any special considerations for merging or releasing this change?
Merge / deployment checklist