Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 31 additions & 6 deletions mongodb-query-index-check/prompts/review.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,19 +45,29 @@ Reconstruct the before- and after-state of each MongoDB call and compare both ag

1. **Partial-filter qualifier dropped or rewritten.** If a candidate index has `partialFilterExpression: { removedAt: null }`, the after-query must include every key. Removing or rewriting it (e.g. `removedAt: null` β†’ `disabledAt: { $ne: null }`) makes the index unusable.
2. **New `sort` outside the chosen index's ESR position.** Sort field must follow all equality fields in the index, else in-memory sort.
3. **Shard-key prefix dropped.** Sharded collections in `$MONGO_INDEXES_DIR` are marked with `// For sharding ...` next to a unique index whose first field is the shard key (often `userId`). Dropping it fans out to every shard.
3. **Sharded collection query omits the shard key from its filter / `$match`.** Sharded queries should include the shard key (in the `find()` filter, the filter argument of `findOneAnd*` / `updateOne` / `deleteOne` / etc., or the first `$match` of an `aggregate`) β€” without it, MongoDB fans the query out to every shard. The canonical source for shard keys is the `ShardAwareCollection` class in `apify/apify-core`'s `mongo-connection` package β€” its constructor takes the shard keys as the second positional argument: `new ShardAwareCollection<TSchema, TShardKeys>(collection, shardKeys, openTelemetry?)`, used like `new ShardAwareCollection<ActorRun, 'userId' | '_id'>(this.Act2Runs, ['userId', '_id'], openTelemetry)` β€” meaning `Act2Runs` is sharded on `[userId, _id]`. To find shard keys for a collection, grep the workspace for `new ShardAwareCollection<` and read off the array literal at the matching usage. Secondary signal: a `// For sharding <Collection>` comment in the matching `$MONGO_INDEXES_DIR/<file>.ts`. When neither source confirms sharding, prefer not to flag.
4. **New or modified `$or` branch.** Each branch is matched independently. Flag any branch without an index.
5. **Equality β†’ negation** (`$ne`, `$nin`, `$not`, `$exists: false`). Forces a scan even on indexed fields.
6. **Unanchored `$regex` on an indexed field.** Only `/^prefix/` uses the index.
7. **Range filter before equality in a compound query.** ESR order: equality, sort, range. Reversing neutralises the index.
8. **Sharded-collection operation that should pass `readConcern: 'available'`.** Sharded chunks can contain orphan documents (residual from chunk migrations). For operations that would normally be served from the index alone, MongoDB has to fetch every candidate document and run a `SHARDING_FILTER` stage to drop orphans β€” expensive at scale. Passing `readConcern: 'available'` bypasses that filter at the cost of possibly including a few orphans (fine for most read paths). Known examples:
- **`.skip(N)` / `$skip: N`** β€” must fetch every skipped doc for the orphan check; measured ~60s on `Act2Runs` at skip=100k. Suggest `readConcern: 'available'` or cursor pagination using the sort key (e.g. `startedAt < lastStartedAt` for a `sort: { startedAt: -1 }` cursor).
- **Count operations** β€” `ShardAwareCollection.approximateCountDocuments()` wraps `readConcern: 'available'` already, so **don't flag it**. Plain `.countDocuments()` on the underlying `Collection` (or via `.rawCollection`) on a sharded collection does *not* β€” flag those.
- **`aggregate` stages that traverse a wide index range without narrowing down** (`$count`, `$group` over many keys) β€” likely same problem.

Reason from the underlying principle for novel cases too: any operation that would be index-only on a non-sharded collection but suddenly needs document fetches on a sharded one is a candidate.
9. **Query on a heavy-index collection without `hint`.** If the collection file in `$MONGO_INDEXES_DIR` has many `ensureIndex` calls (e.g. `actor_jobs.ts` for `Act2Runs` has 40+), the planner can spend tens of seconds evaluating candidate plans on cold caches. Recommend `hint: '<index name>'` to pin plan selection. The chosen index needs `name:` set; if it doesn't, recommend adding one first. 🟑 medium.
10. **Read without `projection`.** `find()`, `findOne()`, and `findOneAnd*()` calls that omit a `projection` (either in the options object or via a chained `.project()`) return every field of every matched document β€” wasted IO, network bandwidth, and BSON parse time. Documents in `Users`, `Act2Runs`, etc. are easily 10–100 KB, and most callers only need a handful of fields. Flag any added/modified read that lacks a projection; suggest the smallest projection covering the fields actually consumed downstream (and just `{ _id: 1 }` for existence checks). Skip writes (`updateOne`, `deleteOne`, `bulkWrite`) and `aggregate` (which shapes via `$project` and is often a full transformation anyway). 🟑 medium.

## Concrete examples

- **OK βœ…** β€” `Act2Runs.find({ userId, status, removedAt: null }).sort({ startedAt: -1 })` against index `{ userId: 1, status: 1, startedAt: -1 }` with `partialFilterExpression: { removedAt: null }`. Prefix matches, sort follows equality, partial filter matches.
- **🟠 high** β€” same query without `removedAt: null`. Partial-filter index no longer applies. Name the dropped key in the comment.
- **🟠 high** β€” `.sort({ 'profile.name': 1 })` added to a query whose chosen index is `{ userId: 1, finishedAt: 1 }`. In-memory sort.
- **🟠 high** β€” `Act2Runs.find({ status })` on a collection sharded by `userId`. Fans out to every shard.
- **🟠 high** β€” `Act2Runs.find({ userId, actId, removedAt: null }, { sort: { startedAt: -1 }, limit: 100, skip: 100_000 })` on a sharded collection without `readConcern: 'available'`. The `SHARDING_FILTER` stage fetches and orphan-checks the 100k skipped docs (~60s observed). Pass `readConcern: 'available'`, or refactor to cursor pagination using `startedAt < lastStartedAt`.
- **🟑 medium** β€” `$or` with one indexed branch and one unindexed branch β€” flag the unindexed branch.
- **🟑 medium** β€” `Users.findOne({ _id: id })` with no projection. Returns the full ~50 KB user document when the caller only reads `.username`. Suggest `{ projection: { username: 1 } }` (or `{ _id: 1 }` if it's an existence check).

## Severity classification

Expand All @@ -69,11 +79,14 @@ Apply this ESR-aware (Equality β†’ Sort β†’ Range) rubric. Pick the highest appl
- **Partial filter expression** of the matching index is incompatible with the query (e.g. index has `partialFilterExpression: { removedAt: null }` but the query also wants `removedAt: { $exists: true }`).
- **Sort can't use the index** (sort field absent from the index, or appears before equality fields).
- Query uses **`$regex` without an anchor** (`/^…/`) on an indexed field β€” the regex still won't use the index without an anchor.
- **`.skip(N)` / `$skip` on a sharded collection without `readConcern: 'available'`** β€” `SHARDING_FILTER` orphan-check dominates query time for large skips.
- 🟑 **medium** β€” An index exists and is used, but is **likely inefficient**:
- Filter is on a **low-selectivity** field only (e.g. only on a boolean or status enum that covers most documents) and there's no further filter to narrow it.
- **Read/return ratio likely poor**: index scans many docs the query then discards (e.g. range-then-equality compound order, or `$ne` / `$nin` on an indexed field).
- Sort uses the index but the direction doesn't match (compound index would need a reverse scan).
- Multiple `$or` branches and at least one has no usable index.
- Query on a collection with **many overlapping indexes** lacks `hint:` β€” planner spends tens of seconds choosing a plan.
- Read returns the full document because `projection` was omitted β€” wastes IO and BSON parse time for collections with multi-KB documents.
- 🟒 **low** β€” Stylistic / advisory:
- Could tighten an existing `partialFilterExpression` to reduce index size.
- Project to fewer fields to make this a covered query.
Expand All @@ -87,7 +100,8 @@ Follow these steps in order:

- Read `$CHANGED_FILES_PATH` (a JSON array of file paths). Use `Read` or `cat`.
- For each file, fetch its diff. Prefer the GitHub MCP server (e.g. `mcp__github__pull_request_read` with `get_diff` / `get_files`, passing owner+repo from `$REPO` and `pull_number: $PR_NUMBER`). If that tool isn't available, fall back to `gh pr diff $PR_NUMBER --repo $REPO -- <path>` via Bash. Focus on diff hunks that contain at least one added line β€” in those hunks, the after-state (added lines plus their unchanged context) is the surface to analyze; ignore hunks that contain only deletions.
- For each MongoDB call you find, identify the collection and read `$MONGO_INDEXES_DIR/<collection-snake-case>.ts` to learn the available indexes. Use `Read` (preferred) or `cat` / `grep` / `find` on `$MONGO_INDEXES_DIR/*`. Don't read anything outside `$MONGO_INDEXES_DIR`.
- For each MongoDB call you find, identify the collection and read `$MONGO_INDEXES_DIR/<collection-snake-case>.ts` to learn the available indexes. Use `Read` (preferred) or `cat` / `grep` / `find`.
- When you need to confirm whether a collection is sharded (or on which keys), grep the workspace for `new ShardAwareCollection<` and read off the shard-key array from the matching usage.

### 2. Decide findings

Expand All @@ -104,7 +118,18 @@ Guardrails:
- Skip silently if the collection file doesn't exist in `$MONGO_INDEXES_DIR`, or you can't determine the collection reliably.
- No findings β†’ write `none` to `$RESULT_PATH` and exit silently.

### 3. Post the review (only when there is at least one finding)
### 3. Deduplicate against previous runs

A previous run on this PR may have already posted some of these findings. Filter duplicates before posting:

1. Call `mcp__github__pull_request_read` with `method: get_review_comments` and `pullNumber: $PR_NUMBER`. The response is a list of review threads; each has `is_outdated` and a `comments` array (every comment has `author`, `body`, `path`, `line`).
2. Collect comments whose `author` is `github-actions`, whose containing thread is not `is_outdated`, and whose `body` starts with the template prefix defined in step 4 (any of the four severity emojis followed by ` **MongoDB index check`).
3. Drop any finding whose `(path, line)` matches a collected comment. Outdated threads don't count β€” line numbers shift across commits, so a fresh finding on a shifted line is a new finding.
4. If no new findings remain, compute the max severity from the original (un-filtered) findings list, write it to `$RESULT_PATH`, and exit silently. Do not submit an empty review.

If `get_review_comments` errors, proceed without dedup β€” a duplicate review is better than no review.

### 4. Post the review (only when at least one new finding remains)

Use whichever GitHub MCP review tools are available. Common shapes:

Expand All @@ -120,7 +145,7 @@ For each finding, point `path` and `line` at the **after-state RIGHT line** of t

{{1–3 sentences describing the issue. Reference the specific filter fields and the specific existing index by its `fields + name` if it has a name. Be concrete.}}

**Suggested action**: {{Either point to an existing index that should be used and what to adjust in the query, or recommend adding a new index in `src/packages/mongo-indexes/src/<file>.ts`. Be specific about which fields and partial-filter expressions.}}
**Suggested action**: {{Prefer fixes that don't add new indexes β€” busy collections (`Users`, `Act2Runs`, …) already have dozens and the explicit goal is to shrink that surface, not grow it. In order of preference: (1) adjust the query to fit an existing index β€” add the missing prefix field, include the `partialFilterExpression` qualifier, drop the offending sort, switch to cursor pagination, add a `projection`, etc.; name the index by `name:` if it has one, else by key spec. (2) Extend or replace an existing index so it covers both this query and its current callers (e.g. add one field to a compound) β€” name the index to modify and the new shape. (3) Only as a last resort, recommend a new index in `src/packages/mongo-indexes/src/<file>.ts`, and explicitly say why (1) and (2) don't work.}}

Where `{{SEVERITY_EMOJI}}` / `{{SEVERITY_WORD}}` is one of `πŸ”΄ critical`, `🟠 high`, `🟑 medium`, `🟒 low`.

Expand All @@ -131,9 +156,9 @@ Decide the review event:

When submitting, include a brief summary body β€” at most 4 short bullets covering: total findings count, breakdown by severity, and any cross-cutting recommendation (e.g. "Consider adding a compound index on `{userId, actorTaskId}` for these three queries"). End the summary body with the line `cc @mtrunkat` so they get notified of every review with findings.

### 4. Persist the result
### 5. Persist the result

After submitting the review (or after deciding no review is needed), write the maximum severity to `$RESULT_PATH` as a single lowercase word with **no whitespace and no newline**. Examples: `none`, `low`, `medium`, `high`, `critical`. **Use the `Write` tool** β€” bash output redirection (`>`, `>>`) is blocked by the sandbox even for paths inside the workspace, so `printf > $RESULT_PATH` will fail.
After submitting the review (or after skipping submission per step 3), write the maximum severity to `$RESULT_PATH` as a single lowercase word with **no whitespace and no newline**. Examples: `none`, `low`, `medium`, `high`, `critical`. The severity must reflect **all findings you identified in step 2**, including ones dropped as duplicates β€” re-running the workflow must never silently green a previously-failing check. **Use the `Write` tool** β€” bash output redirection (`>`, `>>`) is blocked by the sandbox even for paths inside the workspace, so `printf > $RESULT_PATH` will fail.

## Bash sandbox notes

Expand Down
Loading