Cache first failure building an overlay base DB to avoid repeated failures by henrymercer · Pull Request #3487 · github/codeql-action

henrymercer · 2026-02-17T15:57:37Z

When overlay analysis (improved incremental analysis) fails on a runner — typically due to insufficient disk space — this PR records that failure in the Actions cache so that subsequent runs will skip overlay analysis automatically until something changes (e.g. a larger runner is provisioned or a new CodeQL version is released).

See the backlinked internal issue for more information.

I recommend reviewing the first commit separately from the rest as this moves the overlay utilities into their own directory.

Risk assessment

For internal use only. Please select the risk level of this change:

Low risk: Changes are fully under feature flags, or have been fully tested and validated in pre-production environments and are highly observable, or are documentation or test only.

Which use cases does this change impact?

Workflow types:

Advanced setup - Impacts users who have custom CodeQL workflows.
Managed - Impacts users with dynamic workflows (Default Setup, CCR, ...).

Products:

Code Scanning - The changes impact analyses when analysis-kinds: code-scanning.

Environments:

Dotcom - Impacts CodeQL workflows on github.com and/or GitHub Enterprise Cloud with Data Residency.

How did/will you validate this change?

Test repository - This change will be tested on a test repository before merging.
Unit tests - I am depending on unit test coverage (i.e. tests in .test.ts files).

If something goes wrong after this change is released, what are the mitigation and rollback strategies?

Feature flags - All new or changed code paths can be fully disabled with corresponding feature flags.

How will you know if something goes wrong after this change is released?

Telemetry - I rely on existing telemetry or have made changes to the telemetry.
- Dashboards - I will watch relevant dashboards for issues after the release. Consider whether this requires this change to be released at a particular time rather than as part of a regular release.
- Alerts - New or existing monitors will trip if something goes wrong with this change.

Are there any special considerations for merging or releasing this change?

No special considerations - This change can be merged at any time.

Merge / deployment checklist

Confirm this change is backwards compatible with existing workflows.
Consider adding a changelog entry for this change.
Confirm the readme and docs have been updated if necessary.

Use [...languages].sort() instead of languages.sort() to avoid mutating the caller's array as a side effect.

Copilot

Pull request overview

This pull request implements a caching mechanism to avoid repeated failures when overlay analysis (improved incremental analysis) fails on a runner due to insufficient resources, typically disk space. The PR introduces a status tracking system that records failures in the Actions cache, allowing subsequent runs to skip overlay analysis automatically until conditions change (e.g., runner upgrade or new CodeQL version).

Changes:

Added a new src/overlay/status.ts module to track and persist overlay analysis failures via Actions cache
Modified src/init-action-post-helper.ts to record failures when overlay base database builds are unsuccessful
Updated src/config-utils.ts to check cached status and skip overlay analysis when previous failures are detected
Added two new feature flags: OverlayAnalysisStatusCheck and OverlayAnalysisStatusSave
Modified bundleDb function signature to accept an includeDiagnostics parameter
Reorganized overlay-related imports from overlay-database-utils to overlay module structure

Reviewed changes

Copilot reviewed 29 out of 31 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
src/overlay/status.ts	New module implementing status persistence using Actions cache with timeout handling
src/overlay/status.test.ts	Comprehensive unit tests for the new status tracking functionality
src/init-action-post-helper.ts	Integration to save failure status after unsuccessful overlay-base builds
src/init-action-post-helper.test.ts	Tests verifying status saving under different conditions (failure, success, disabled)
src/config-utils.ts	Integration to check cached status and skip overlay analysis when indicated
src/config-utils.test.ts	Tests for skipping overlay analysis based on cached status
src/feature-flags.ts	Added two new feature flags for status check and save operations
src/util.ts	Modified `bundleDb` signature to accept `includeDiagnostics` parameter
src/database-upload.ts	Updated `bundleDb` call to pass `includeDiagnostics: false`
src/debug-artifacts.ts	Updated `bundleDb` call to pass `includeDiagnostics: true`
src/doc-url.ts	Added new documentation URL for deleting Actions cache entries
src/testing-utils.ts	Updated import path from `overlay-database-utils` to `overlay`
src/status-report.ts	Updated import path from `overlay-database-utils` to `overlay`
src/overlay/index.ts	Updated import paths to use relative paths from parent directory
src/overlay/index.test.ts	Updated imports to match new module structure
src/init-action.ts	Updated import path from `overlay-database-utils` to `overlay`
src/analyze.ts	Updated import path from `overlay-database-utils` to `overlay`
src/analyze-action.ts	Updated import path from `overlay-database-utils` to `overlay`
src/codeql.ts	Updated `databaseBundle` signature to include `includeDiagnostics` parameter
lib/*	Generated JavaScript files reflecting all TypeScript changes

mbg

Thanks for putting this together! There's a lot going on here, including some things I like a lot. I have left a bunch of detailed comments, and there are also some high-level points:

From previous experience, we know that we need to be careful about the construction of cache keys. We should consider more thoroughly what we need to include in the keys here to not shoot ourselves in the foot. In particular, I'd like us to better identify the runner, the analysis (thinking about Advanced Setup), and guarding against changes we need to make to the implementation.
Caches can interact poorly with feature flags, if the feature flags affect what might be in a cache. We currently have a number of FFs which affect the overlay analysis behaviour and analysis in general. We might want to include these in the cache key so that e.g. we don't roll out a feature flag that breaks all overlay base database builds, then roll back the feature flag, but are stuck with caches that indicate failure for two weeks.
The decision whether a base database build failed is currently local to a single workflow run. Consider a scenario where we successfully built an overlay base database with CodeQL version X and uploaded it. Now we are running again for a new commit, but building the overlay base database fails with the same CodeQL version -- perhaps due to an intermittent failure. We upload the status file and block all future base overlay db builds for this CodeQL version and all PR runs from even trying to download the existing base db, which is still in the cache. Perhaps it would be worth checking whether there is an existing base db for the same CodeQL version in the cache?

src/overlay/index.ts

src/overlay/status.ts

mbg · 2026-02-17T21:01:22Z

src/overlay/status.ts

+  //
+  // Limitation: this can still flip from "too small" to "large enough" and back again if the disk
+  // space fluctuates above and below a multiple of 10 GB.
+  const diskSpaceToNearest10Gb = `${10 * Math.floor(diskUsage.numTotalBytes / (10 * 1024 * 1024 * 1024))}GB`;


Design question: this fundamentally assumes that the CodeQL analysis typically runs on comparable runners. I.e. the assumption is that unless the amount of total disk space is increased deliberately, the runner specs are the same. Practically speaking, I'd expect that to be the case as well. However, I am not sure whether it is necessarily the case or this is an assumption we have made previously.

My concern is that, if a customer has a runner group for CodeQL containing runners with different specs, we might flip-flop on this -- I think you express that in the "Limitation" part of the comment. That wouldn't be a great experience for a customer if it happened. Is there a way we can mitigate this?

Good point, this might be something that's worth checking in telemetry before rolling this out.

src/overlay/status.test.ts

mbg · 2026-02-17T22:38:25Z

src/overlay/status.ts

+  try {
+    const foundKey = await waitForResultWithTimeLimit(
+      MAX_CACHE_OPERATION_MS,
+      actionsCache.restoreCache([statusFile], cacheKey),


The paths to restore are an implicit component of the cache key. In this case, if statusFile is different between store and restore, then the cache won't get restored here. Since the path depends on getTemporaryDirectory(), we are dependent on that returning the same path every time.

I see that our dependency caching implementation also relies on getTemporaryDirectory() returning the same path for Java/C#. We should probably check that this doesn't cause any issues and ideally move to something that's more reliably stable.

Do you have any ideas of how to work around this? It seems desirable to store temporary files under RUNNER_TEMP where possible.

src/config-utils.ts

src/init-action-post-helper.ts

henrymercer · 2026-02-19T18:23:47Z

@mbg Thanks for the review, I believe I've addressed all your comments, and we discussed the ones in your overall comment offline.

…cks-v2 Add feature flag for more lenient overlay resource checks

mbg

Happy to approve now. The changes you've made in response to my previous feedback are good and we have discussed the other points in depth elsewhere.

Just one minor comment, don't feel obliged to address it as part of this PR.

mbg · 2026-02-23T17:26:18Z

src/init-action-post-helper.ts

+        "The attempt to save this failure status to the Actions cache failed. The Action will attempt to " +
+        "save this failure status again on the next run, so future runs will skip improved incremental analysis. " +


Minor: Simplify the second sentence to "The Action will attempt to run with improved incremental analysis again." since it may not necessarily fail again.

henrymercer added 20 commits February 17, 2026 15:54

Create separate directory for overlay source code

d1bdc0e

Compute cache key for overlay language status

d28d996

Add save and restore methods

69c2819

Generalise status to multiple languages

e275d63

Skip overlay analysis based on cached status

ebad062

Save overlay status to Actions cache

96961e0

Introduce feature flags for saving and checking status

827bba6

Be more explicit about attempt to build overlay DB

6c405c2

Sort doc URLs

0c47ae1

Add status page diagnostic when overlay skipped

7b7a951

Only store overlay status if analysis failed

ef58c00

Improve diagnostic message wording

cc0dce0

Tweak diagnostic message

d24014a

Improve error messages

3dd1275

More error message improvements

554b931

Include diagnostics in bundle

5c583bb

Avoid mutating languages array in overlay status functions

05d4e25

Use [...languages].sort() instead of languages.sort() to avoid mutating the caller's array as a side effect.

Add tests for shouldSkipOverlayAnalysis

657f337

Extract status file path helper

fa56ea8

Improve log message

898ae16

github-actions bot added the size/XL May be very hard to review label Feb 17, 2026

henrymercer marked this pull request as ready for review February 17, 2026 18:11

henrymercer requested a review from a team as a code owner February 17, 2026 18:11

Copilot AI review requested due to automatic review settings February 17, 2026 18:11

Copilot started reviewing on behalf of henrymercer February 17, 2026 18:12 View session

Copilot AI reviewed Feb 17, 2026

View reviewed changes

mbg reviewed Feb 18, 2026

View reviewed changes

Address review comments

4191f52

henrymercer requested a review from mbg February 20, 2026 17:12

henrymercer mentioned this pull request Feb 20, 2026

Add feature flag for more lenient overlay resource checks #3498

Merged

henrymercer added 2 commits February 20, 2026 18:26

Add feature flag for more lenient overlay resource checks

4e71011

Merge pull request #3498 from github/henrymercer/overlay-resource-che…

1847416

…cks-v2 Add feature flag for more lenient overlay resource checks

mbg approved these changes Feb 23, 2026

View reviewed changes

		"The attempt to save this failure status to the Actions cache failed. The Action will attempt to " +
		"save this failure status again on the next run, so future runs will skip improved incremental analysis. " +

Comments

Conversation

henrymercer commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Risk assessment

Which use cases does this change impact?

How did/will you validate this change?

If something goes wrong after this change is released, what are the mitigation and rollback strategies?

How will you know if something goes wrong after this change is released?

Are there any special considerations for merging or releasing this change?

Merge / deployment checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

mbg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mbg Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

henrymercer Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mbg Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

henrymercer Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

henrymercer commented Feb 19, 2026

Uh oh!

mbg left a comment

Choose a reason for hiding this comment

Uh oh!

mbg Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

henrymercer commented Feb 17, 2026 •

edited

Loading