Skip to content

Commit 5725b55

Browse files
coadaflorinCopilot
andcommitted
Add incremental analysis documentation for the CodeQL CLI
Add a new article covering diff-informed analysis and overlay analysis, two complementary features that speed up CodeQL analysis for pull requests when using the CodeQL CLI in CI systems. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 42960f7 commit 5725b55

2 files changed

Lines changed: 361 additions & 0 deletions

File tree

Lines changed: 360 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,360 @@
1+
---
2+
title: Using incremental analysis with the CodeQL CLI
3+
shortTitle: Incremental analysis
4+
intro: 'Speed up {% data variables.product.prodname_codeql %} analysis for pull requests by using diff-informed analysis and overlay analysis with the {% data variables.product.prodname_codeql_cli %}.'
5+
allowTitleToDifferFromFilename: true
6+
product: '{% data reusables.gated-features.codeql %}'
7+
versions:
8+
fpt: '*'
9+
ghes: '*'
10+
ghec: '*'
11+
contentType: how-tos
12+
category:
13+
- Customize vulnerability detection with CodeQL
14+
---
15+
16+
## About incremental analysis
17+
18+
When you use the {% data variables.product.prodname_codeql_cli %} in a CI system, you can enable two complementary features to speed up {% data variables.product.prodname_codeql %} analysis on pull requests:
19+
20+
* **Diff-informed analysis** restricts query results to alerts whose locations fall within the lines added or modified in the pull request diff. This makes analysis faster and more focused.
21+
* **Overlay analysis** speeds up database creation and query evaluation by building on top of a pre-existing "base" database from the default branch, instead of creating a full database from scratch for every pull request.
22+
23+
These features are independent and can be used separately or together. When both are active, overlay analysis handles efficient database creation and query evaluation, while diff-informed analysis handles efficient result filtering.
24+
25+
## Diff-informed analysis
26+
27+
Diff-informed analysis is an optimization for pull request analysis. Instead of reporting all alerts found in the codebase, it restricts the results to only those alerts whose locations fall within lines that were added or modified in the pull request diff.
28+
29+
**Minimum {% data variables.product.prodname_codeql_cli %} bundle version:** 2.21.0
30+
31+
### How diff-informed analysis works
32+
33+
1. Compute the diff between the pull request base branch and head branch.
34+
1. Determine the added or modified line ranges from the diff.
35+
1. Package those line ranges as a {% data variables.product.prodname_codeql %} data extension that feeds into the `restrictAlertsTo` extensible predicate in the `codeql/util` standard library.
36+
1. Pass the extension pack to `codeql database run-queries` so that queries can restrict their computation to alerts in changed lines.
37+
1. Filter the SARIF output on the CI side to remove any remaining alerts outside the diff ranges.
38+
39+
> [!NOTE]
40+
> CI-side SARIF filtering (step 5) is required because the `restrictAlertsTo` predicate permits, but does not require, queries to omit out-of-range alerts. Filtering ensures that the final set of reported alerts is stable and limited to the diff range, regardless of query-side behavior.
41+
42+
### Step 1: Determine the diff
43+
44+
You need the unified diff between the base commit and the head commit of the pull request. You can use `git diff`, your source control management system's API, or any other mechanism to obtain the diff.
45+
46+
```shell
47+
git diff BASE_SHA...HEAD_SHA
48+
```
49+
50+
> [!NOTE]
51+
> If your diff source truncates or is incomplete for large pull requests (for example, an API that limits the number of changed files), you should disable diff-informed analysis and fall back to full analysis for that run.
52+
53+
### Step 2: Parse the diff into line ranges
54+
55+
From the unified diff, extract the added or modified line ranges in the head version of each file. For each file, you need an array of ranges with the following structure:
56+
57+
* `path`: Absolute file path (always use forward slashes)
58+
* `startLine`: 1-based, inclusive start line
59+
* `endLine`: 1-based, inclusive end line
60+
61+
To parse the diff:
62+
63+
1. Split each file's patch into lines.
64+
1. For lines starting with `@@`, parse the hunk header `@@ -X,Y +Z,W @@` to find `Z` (the starting line number in the new file). Set `currentLine = Z`.
65+
1. For lines starting with `+` (additions), record the start of a new range if one isn't in progress. Increment `currentLine`.
66+
1. For lines starting with `-` (deletions), skip the line. Deletions don't affect new-file line numbers.
67+
1. For context lines (starting with a space), close any in-progress range and increment `currentLine`.
68+
69+
**Special cases:**
70+
71+
* **Binary files or very large diffs** (no patch content available): Use the sentinel range `{path, startLine: 0, endLine: 0}` to indicate "entire file."
72+
* **Renamed files with no content changes**: Return an empty array (no ranges).
73+
74+
### Step 3: Create a data extension pack
75+
76+
Create a temporary directory containing two files. This extension pack feeds into the `restrictAlertsTo` extensible predicate defined in the {% data variables.product.prodname_codeql %} standard library.
77+
78+
**`qlpack.yml`:**
79+
80+
```yaml
81+
name: my-ci/pr-diff-range
82+
version: 0.0.0
83+
library: true
84+
extensionTargets:
85+
codeql/util: '*'
86+
dataExtensions:
87+
- pr-diff-range.yml
88+
```
89+
90+
**`pr-diff-range.yml`:**
91+
92+
```yaml
93+
extensions:
94+
- addsTo:
95+
pack: codeql/util
96+
extensible: restrictAlertsTo
97+
checkPresence: false
98+
data:
99+
- ["/absolute/path/to/file1.ts", 42, 45]
100+
- ["/absolute/path/to/file2.ts", 10, 12]
101+
```
102+
103+
Each data row is `[filePath, lineStart, lineEnd]`. Line numbers are 1-based. The special case `lineStart = 0, lineEnd = 0` denotes a whole-file match.
104+
105+
> [!IMPORTANT]
106+
> If the diff has zero added or modified lines (for example, only deletions), you must still provide a non-empty data extension with a sentinel entry `["", 0, 0]`. An empty `data` section would leave the `restrictAlertsTo` predicate inactive, which means all alerts would be produced—the opposite of the desired behavior.
107+
108+
### Step 4: Pass the extension pack to the {% data variables.product.prodname_codeql_cli %}
109+
110+
When running queries, add the following flags to `codeql database run-queries`:
111+
112+
```shell
113+
codeql database run-queries \
114+
--additional-packs=PATH_TO_EXTENSION_PACK \
115+
--extension-packs=my-ci/pr-diff-range \
116+
PATH_TO_DATABASE \
117+
QUERIES
118+
```
119+
120+
* `--additional-packs` tells {% data variables.product.prodname_codeql %} where to find the pack on disk.
121+
* `--extension-packs` tells {% data variables.product.prodname_codeql %} to load the named extension pack.
122+
123+
### Step 5: Exclude diagnostic queries
124+
125+
When using diff-informed analysis, you should exclude queries tagged with `exclude-from-incremental`. These diagnostic queries do not produce alerts (for example, metrics or code coverage), so they provide no value in an incremental context but still consume resources.
126+
127+
You can add this to your code scanning configuration file:
128+
129+
```yaml
130+
query-filters:
131+
- exclude:
132+
tags: exclude-from-incremental
133+
```
134+
135+
Alternatively, create a query suite file (`.qls`) that excludes those queries:
136+
137+
```yaml
138+
- description: Pull request queries for Java
139+
- import: codeql-suites/java-code-scanning.qls
140+
- exclude:
141+
tags contain: exclude-from-incremental
142+
```
143+
144+
For more information about code scanning configuration files, see [AUTOTITLE](/code-security/code-scanning/creating-an-advanced-setup-for-code-scanning/customizing-your-advanced-setup-for-code-scanning#specifying-codeql-query-packs).
145+
146+
### Step 6: Filter the SARIF output
147+
148+
After {% data variables.product.prodname_codeql %} generates the SARIF file, you must filter the output on the CI side to remove results whose locations fall outside the diff ranges.
149+
150+
For each result in the SARIF, check whether any of its `locations` or `relatedLocations` have a `startLine` that falls within a diff range for that file. If none match, remove the result. The filtering logic checks: `range.startLine <= alert.startLine <= range.endLine`, with the special case that `range.startLine == range.endLine == 0` matches any alert in the file.
151+
152+
> [!NOTE]
153+
> Make sure SARIF artifact locations are normalized to the same path format used in the diff ranges before comparing.
154+
155+
### Step 7: Tag the SARIF output (optional)
156+
157+
You can add the following flag to your existing `codeql database interpret-results` command to tag the SARIF output with metadata indicating that diff-informed analysis was used:
158+
159+
```shell
160+
codeql database interpret-results \
161+
--sarif-run-property=incrementalMode=diff-informed \
162+
PATH_TO_DATABASE
163+
```
164+
165+
### Summary of CLI flags for diff-informed analysis
166+
167+
| CLI command | Flag | Purpose |
168+
|---|---|---|
169+
| `codeql database init` | `--codescanning-config=FILE` | Code scanning configuration file (for query filter) |
170+
| `codeql database run-queries` | `--additional-packs=DIR` | Location of the extension pack |
171+
| `codeql database run-queries` | `--extension-packs=my-ci/pr-diff-range` | Name of the extension pack to load |
172+
| `codeql database interpret-results` | `--sarif-run-property=incrementalMode=diff-informed` | Tag SARIF with diff-informed metadata |
173+
174+
## Overlay analysis
175+
176+
Overlay analysis speeds up {% data variables.product.prodname_codeql %} database creation and analysis for pull requests by building on top of a pre-existing "base" database. Instead of creating a full database from scratch for every pull request, it:
177+
178+
1. **On the default branch:** Builds an "overlay-base" database (a full database with cached intermediate results and extra metadata).
179+
1. **On pull requests:** Downloads the cached overlay-base database, then creates a lightweight "overlay" database that only processes the changed files on top of the base.
180+
181+
This dramatically reduces both database creation time and query evaluation time for pull requests.
182+
183+
**Minimum {% data variables.product.prodname_codeql_cli %} bundle version:** 2.23.8 (with per-language minimums—see "[Minimum CLI bundle versions](#minimum-cli-bundle-versions)")
184+
185+
### Requirements
186+
187+
Before using overlay analysis, make sure the following requirements are met:
188+
189+
* The source root must be inside a Git repository.
190+
* Git version 2.38.0 or later (for the `--format` option used by `git ls-files`).
191+
* All files of interest must be tracked by Git (not in `.gitignore`).
192+
* The Git index must accurately reflect the source tree being analyzed.
193+
* Overlay analysis supports only `build-mode: none`. Traced builds are not supported. If a language is configured with traced builds, overlay analysis is not available for that language.
194+
195+
> [!NOTE]
196+
> Go does not specifically support `build-mode: none`, but the Go extractor behaves sufficiently similarly that overlay analysis works with it.
197+
198+
### Overlay-base mode (default branch)
199+
200+
Run overlay-base mode on your default branch after each merge to create and cache a base database.
201+
202+
#### 1. Initialize the database with `--overlay-base`
203+
204+
```shell
205+
codeql database init \
206+
--overlay-base \
207+
--db-cluster \
208+
PATH_TO_DATABASE \
209+
--source-root=PATH_TO_SOURCE \
210+
--language=LANGUAGE
211+
```
212+
213+
The `--overlay-base` flag tells {% data variables.product.prodname_codeql %} to build a database that can serve as a base for future overlay analysis.
214+
215+
#### 2. Build and extract as normal
216+
217+
Run any build steps and extraction as you normally would for your project.
218+
219+
#### 3. Record file OIDs
220+
221+
After extraction completes, record the Git object IDs (OIDs) of all tracked files under the source root. This snapshot is used later to determine which files changed.
222+
223+
```shell
224+
git ls-files --recurse-submodules --format='%(objectname)_%(path)'
225+
```
226+
227+
Parse this output into a JSON map of `{ "relative/path": "git-oid" }` and store it alongside the database.
228+
229+
> [!NOTE]
230+
> The output includes files in Git submodules. Overlay analysis requires accurate tracking of all file changes between the base and the overlay, including those within submodules.
231+
232+
#### 4. Run queries and preserve the cache
233+
234+
When running queries on an overlay-base database, do **not** pass `--expect-discarded-cache`. This flag tells {% data variables.product.prodname_codeql %} that cached intermediate results can be discarded, but for overlay-base databases you need to preserve them for reuse.
235+
236+
#### 5. Clean up and cache the database
237+
238+
After analysis, clean up the database using the `overlay` cleanup level:
239+
240+
```shell
241+
codeql database cleanup PATH_TO_DATABASE --cache-cleanup=overlay
242+
```
243+
244+
The `overlay` cleanup level preserves more cached data than the default `clear` level, because overlay databases need that cached data for efficient query evaluation.
245+
246+
Then store the database (including the OIDs file) in your caching system for later retrieval by pull request builds.
247+
248+
### Overlay mode (pull requests)
249+
250+
Run overlay mode on pull request builds to create a lightweight database on top of the cached base.
251+
252+
> [!IMPORTANT]
253+
> If no compatible overlay-base database is available in the cache (for example, on the first run or after a {% data variables.product.prodname_codeql_cli %} version upgrade), do not pass `--overlay-changes`. Instead, run a normal full analysis. Cache keys should include at least the {% data variables.product.prodname_codeql_cli %} version and language set to avoid incompatible base databases.
254+
255+
#### 1. Download the cached overlay-base database
256+
257+
Retrieve the most recent overlay-base database from your cache. The database should include the OIDs file recorded during overlay-base mode.
258+
259+
#### 2. Compute changed files
260+
261+
Compare the OIDs recorded in the base database with the current Git state:
262+
263+
```shell
264+
git ls-files --recurse-submodules --format='%(objectname)_%(path)'
265+
```
266+
267+
Compare the two maps to find files that were added, removed, or modified (different OID). Write the result as a JSON file:
268+
269+
```json
270+
{
271+
"changes": ["src/modified-file.ts", "src/new-file.ts", "src/deleted-file.ts"]
272+
}
273+
```
274+
275+
The file paths must be relative to the source root.
276+
277+
#### 3. Initialize the database with `--overlay-changes`
278+
279+
Run `codeql database init` against the restored overlay-base database directory. The `PATH_TO_DATABASE` must point to the restored cached overlay-base database, not a new empty directory—the command extends the existing base for the pull request analysis.
280+
281+
```shell
282+
codeql database init \
283+
--overlay-changes=PATH_TO_OVERLAY_CHANGES_JSON \
284+
--db-cluster \
285+
PATH_TO_DATABASE \
286+
--source-root=PATH_TO_SOURCE \
287+
--language=LANGUAGE
288+
```
289+
290+
> [!IMPORTANT]
291+
> In overlay mode, do not pass `--overwrite` or `--force-overwrite`. You are building on top of the existing cached base database, not replacing it.
292+
293+
#### 4. Build, extract, and run queries as normal
294+
295+
Proceed with build, extraction, and query execution as normal. You can add the following flag to your existing `codeql database interpret-results` command to tag the SARIF output with overlay metadata:
296+
297+
```shell
298+
codeql database interpret-results \
299+
--sarif-run-property=incrementalMode=overlay \
300+
PATH_TO_DATABASE
301+
```
302+
303+
If both overlay and diff-informed analysis are active, use `incrementalMode=overlay,diff-informed`.
304+
305+
### Exclude diagnostic queries
306+
307+
Same as for diff-informed analysis, when using overlay mode you should exclude queries tagged `exclude-from-incremental`:
308+
309+
```yaml
310+
query-filters:
311+
- exclude:
312+
tags: exclude-from-incremental
313+
```
314+
315+
For more information, see "[Step 5: Exclude diagnostic queries](#step-5-exclude-diagnostic-queries)."
316+
317+
### Summary of CLI flags for overlay analysis
318+
319+
| CLI command | Flag | Mode | Purpose |
320+
|---|---|---|---|
321+
| `codeql database init` | `--codescanning-config=FILE` | overlay | Code scanning configuration file (for query filter) |
322+
| `codeql database init` | `--overlay-base` | overlay-base | Build a base database for future overlay use |
323+
| `codeql database init` | `--overlay-changes=FILE` | overlay | Build overlay database using only changed files |
324+
| `codeql database init` | _(no `--overwrite`)_ | overlay | Don't overwrite the cached base database |
325+
| `codeql database run-queries` | _(no `--expect-discarded-cache`)_ | overlay-base | Preserve cached intermediate results |
326+
| `codeql database cleanup` | `--cache-cleanup=overlay` | overlay-base | Use overlay-specific cleanup level |
327+
| `codeql database interpret-results` | `--sarif-run-property=incrementalMode=overlay` | overlay | Tag SARIF with overlay metadata |
328+
329+
### Minimum CLI bundle versions
330+
331+
The base minimum version for overlay analysis is 2.23.8. Some languages require higher minimum versions:
332+
333+
| Language | Minimum {% data variables.product.prodname_codeql_cli %} bundle version |
334+
|---|---|
335+
| C# | 2.24.1 |
336+
| Go | 2.24.2 |
337+
| Java | 2.23.8 |
338+
| JavaScript | 2.23.9 |
339+
| Python | 2.23.9 |
340+
| Ruby | 2.23.9 |
341+
342+
> [!NOTE]
343+
> These minimum versions may increase over time as overlay analysis evolves.
344+
345+
## Using both features together
346+
347+
Diff-informed analysis and overlay analysis are independent features that can be used separately or together. When both are active:
348+
349+
1. Overlay analysis handles efficient database creation (only extracting changed files) and efficient query evaluation (reusing cached intermediate results from the base database).
350+
1. Diff-informed analysis handles efficient result filtering (only reporting alerts in changed lines).
351+
1. Both sets of CLI flags are combined in your CI configuration.
352+
353+
### Decision matrix
354+
355+
| Scenario | Diff-informed | Overlay |
356+
|---|---|---|
357+
| Default branch push | No (not a PR) | overlay-base mode |
358+
| PR analysis (first time, no cache) | Yes | No (run full analysis) |
359+
| PR analysis (with cached base) | Yes | overlay mode |
360+
| Non-PR, non-default branch | No | No |

content/code-security/how-tos/find-and-fix-code-vulnerabilities/scan-from-the-command-line/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ children:
1414
- /testing-query-help-files
1515
- /download-databases
1616
- /check-out-source-code
17+
- /incremental-analysis
1718
- /specifying-command-options-in-a-codeql-configuration-file
1819
- /creating-database-bundle-for-troubleshooting
1920
redirect_from:

0 commit comments

Comments
 (0)