diff --git a/mongodb-query-index-check/prompts/review.md b/mongodb-query-index-check/prompts/review.md index a6ebc06..3fbfa31 100644 --- a/mongodb-query-index-check/prompts/review.md +++ b/mongodb-query-index-check/prompts/review.md @@ -45,11 +45,19 @@ Reconstruct the before- and after-state of each MongoDB call and compare both ag 1. **Partial-filter qualifier dropped or rewritten.** If a candidate index has `partialFilterExpression: { removedAt: null }`, the after-query must include every key. Removing or rewriting it (e.g. `removedAt: null` → `disabledAt: { $ne: null }`) makes the index unusable. 2. **New `sort` outside the chosen index's ESR position.** Sort field must follow all equality fields in the index, else in-memory sort. -3. **Shard-key prefix dropped.** Sharded collections in `$MONGO_INDEXES_DIR` are marked with `// For sharding ...` next to a unique index whose first field is the shard key (often `userId`). Dropping it fans out to every shard. +3. **Sharded collection query omits the shard key from its filter / `$match`.** Sharded queries should include the shard key (in the `find()` filter, the filter argument of `findOneAnd*` / `updateOne` / `deleteOne` / etc., or the first `$match` of an `aggregate`) — without it, MongoDB fans the query out to every shard. The canonical source for shard keys is the `ShardAwareCollection` class in `apify/apify-core`'s `mongo-connection` package — its constructor takes the shard keys as the second positional argument: `new ShardAwareCollection(collection, shardKeys, openTelemetry?)`, used like `new ShardAwareCollection(this.Act2Runs, ['userId', '_id'], openTelemetry)` — meaning `Act2Runs` is sharded on `[userId, _id]`. To find shard keys for a collection, grep the workspace for `new ShardAwareCollection<` and read off the array literal at the matching usage. Secondary signal: a `// For sharding ` comment in the matching `$MONGO_INDEXES_DIR/.ts`. When neither source confirms sharding, prefer not to flag. 4. **New or modified `$or` branch.** Each branch is matched independently. Flag any branch without an index. 5. **Equality → negation** (`$ne`, `$nin`, `$not`, `$exists: false`). Forces a scan even on indexed fields. 6. **Unanchored `$regex` on an indexed field.** Only `/^prefix/` uses the index. 7. **Range filter before equality in a compound query.** ESR order: equality, sort, range. Reversing neutralises the index. +8. **Sharded-collection operation that should pass `readConcern: 'available'`.** Sharded chunks can contain orphan documents (residual from chunk migrations). For operations that would normally be served from the index alone, MongoDB has to fetch every candidate document and run a `SHARDING_FILTER` stage to drop orphans — expensive at scale. Passing `readConcern: 'available'` bypasses that filter at the cost of possibly including a few orphans (fine for most read paths). Known examples: + - **`.skip(N)` / `$skip: N`** — must fetch every skipped doc for the orphan check; measured ~60s on `Act2Runs` at skip=100k. Suggest `readConcern: 'available'` or cursor pagination using the sort key (e.g. `startedAt < lastStartedAt` for a `sort: { startedAt: -1 }` cursor). + - **Count operations** — `ShardAwareCollection.approximateCountDocuments()` wraps `readConcern: 'available'` already, so **don't flag it**. Plain `.countDocuments()` on the underlying `Collection` (or via `.rawCollection`) on a sharded collection does *not* — flag those. + - **`aggregate` stages that traverse a wide index range without narrowing down** (`$count`, `$group` over many keys) — likely same problem. + + Reason from the underlying principle for novel cases too: any operation that would be index-only on a non-sharded collection but suddenly needs document fetches on a sharded one is a candidate. +9. **Query on a heavy-index collection without `hint`.** If the collection file in `$MONGO_INDEXES_DIR` has many `ensureIndex` calls (e.g. `actor_jobs.ts` for `Act2Runs` has 40+), the planner can spend tens of seconds evaluating candidate plans on cold caches. Recommend `hint: ''` to pin plan selection. The chosen index needs `name:` set; if it doesn't, recommend adding one first. 🟡 medium. +10. **Read without `projection`.** `find()`, `findOne()`, and `findOneAnd*()` calls that omit a `projection` (either in the options object or via a chained `.project()`) return every field of every matched document — wasted IO, network bandwidth, and BSON parse time. Documents in `Users`, `Act2Runs`, etc. are easily 10–100 KB, and most callers only need a handful of fields. Flag any added/modified read that lacks a projection; suggest the smallest projection covering the fields actually consumed downstream (and just `{ _id: 1 }` for existence checks). Skip writes (`updateOne`, `deleteOne`, `bulkWrite`) and `aggregate` (which shapes via `$project` and is often a full transformation anyway). 🟡 medium. ## Concrete examples @@ -57,7 +65,9 @@ Reconstruct the before- and after-state of each MongoDB call and compare both ag - **🟠 high** — same query without `removedAt: null`. Partial-filter index no longer applies. Name the dropped key in the comment. - **🟠 high** — `.sort({ 'profile.name': 1 })` added to a query whose chosen index is `{ userId: 1, finishedAt: 1 }`. In-memory sort. - **🟠 high** — `Act2Runs.find({ status })` on a collection sharded by `userId`. Fans out to every shard. +- **🟠 high** — `Act2Runs.find({ userId, actId, removedAt: null }, { sort: { startedAt: -1 }, limit: 100, skip: 100_000 })` on a sharded collection without `readConcern: 'available'`. The `SHARDING_FILTER` stage fetches and orphan-checks the 100k skipped docs (~60s observed). Pass `readConcern: 'available'`, or refactor to cursor pagination using `startedAt < lastStartedAt`. - **🟡 medium** — `$or` with one indexed branch and one unindexed branch — flag the unindexed branch. +- **🟡 medium** — `Users.findOne({ _id: id })` with no projection. Returns the full ~50 KB user document when the caller only reads `.username`. Suggest `{ projection: { username: 1 } }` (or `{ _id: 1 }` if it's an existence check). ## Severity classification @@ -69,11 +79,14 @@ Apply this ESR-aware (Equality → Sort → Range) rubric. Pick the highest appl - **Partial filter expression** of the matching index is incompatible with the query (e.g. index has `partialFilterExpression: { removedAt: null }` but the query also wants `removedAt: { $exists: true }`). - **Sort can't use the index** (sort field absent from the index, or appears before equality fields). - Query uses **`$regex` without an anchor** (`/^…/`) on an indexed field — the regex still won't use the index without an anchor. + - **`.skip(N)` / `$skip` on a sharded collection without `readConcern: 'available'`** — `SHARDING_FILTER` orphan-check dominates query time for large skips. - 🟡 **medium** — An index exists and is used, but is **likely inefficient**: - Filter is on a **low-selectivity** field only (e.g. only on a boolean or status enum that covers most documents) and there's no further filter to narrow it. - **Read/return ratio likely poor**: index scans many docs the query then discards (e.g. range-then-equality compound order, or `$ne` / `$nin` on an indexed field). - Sort uses the index but the direction doesn't match (compound index would need a reverse scan). - Multiple `$or` branches and at least one has no usable index. + - Query on a collection with **many overlapping indexes** lacks `hint:` — planner spends tens of seconds choosing a plan. + - Read returns the full document because `projection` was omitted — wastes IO and BSON parse time for collections with multi-KB documents. - 🟢 **low** — Stylistic / advisory: - Could tighten an existing `partialFilterExpression` to reduce index size. - Project to fewer fields to make this a covered query. @@ -87,7 +100,8 @@ Follow these steps in order: - Read `$CHANGED_FILES_PATH` (a JSON array of file paths). Use `Read` or `cat`. - For each file, fetch its diff. Prefer the GitHub MCP server (e.g. `mcp__github__pull_request_read` with `get_diff` / `get_files`, passing owner+repo from `$REPO` and `pull_number: $PR_NUMBER`). If that tool isn't available, fall back to `gh pr diff $PR_NUMBER --repo $REPO -- ` via Bash. Focus on diff hunks that contain at least one added line — in those hunks, the after-state (added lines plus their unchanged context) is the surface to analyze; ignore hunks that contain only deletions. -- For each MongoDB call you find, identify the collection and read `$MONGO_INDEXES_DIR/.ts` to learn the available indexes. Use `Read` (preferred) or `cat` / `grep` / `find` on `$MONGO_INDEXES_DIR/*`. Don't read anything outside `$MONGO_INDEXES_DIR`. +- For each MongoDB call you find, identify the collection and read `$MONGO_INDEXES_DIR/.ts` to learn the available indexes. Use `Read` (preferred) or `cat` / `grep` / `find`. +- When you need to confirm whether a collection is sharded (or on which keys), grep the workspace for `new ShardAwareCollection<` and read off the shard-key array from the matching usage. ### 2. Decide findings @@ -104,7 +118,18 @@ Guardrails: - Skip silently if the collection file doesn't exist in `$MONGO_INDEXES_DIR`, or you can't determine the collection reliably. - No findings → write `none` to `$RESULT_PATH` and exit silently. -### 3. Post the review (only when there is at least one finding) +### 3. Deduplicate against previous runs + +A previous run on this PR may have already posted some of these findings. Filter duplicates before posting: + +1. Call `mcp__github__pull_request_read` with `method: get_review_comments` and `pullNumber: $PR_NUMBER`. The response is a list of review threads; each has `is_outdated` and a `comments` array (every comment has `author`, `body`, `path`, `line`). +2. Collect comments whose `author` is `github-actions`, whose containing thread is not `is_outdated`, and whose `body` starts with the template prefix defined in step 4 (any of the four severity emojis followed by ` **MongoDB index check`). +3. Drop any finding whose `(path, line)` matches a collected comment. Outdated threads don't count — line numbers shift across commits, so a fresh finding on a shifted line is a new finding. +4. If no new findings remain, compute the max severity from the original (un-filtered) findings list, write it to `$RESULT_PATH`, and exit silently. Do not submit an empty review. + +If `get_review_comments` errors, proceed without dedup — a duplicate review is better than no review. + +### 4. Post the review (only when at least one new finding remains) Use whichever GitHub MCP review tools are available. Common shapes: @@ -120,7 +145,7 @@ For each finding, point `path` and `line` at the **after-state RIGHT line** of t {{1–3 sentences describing the issue. Reference the specific filter fields and the specific existing index by its `fields + name` if it has a name. Be concrete.}} - **Suggested action**: {{Either point to an existing index that should be used and what to adjust in the query, or recommend adding a new index in `src/packages/mongo-indexes/src/.ts`. Be specific about which fields and partial-filter expressions.}} + **Suggested action**: {{Prefer fixes that don't add new indexes — busy collections (`Users`, `Act2Runs`, …) already have dozens and the explicit goal is to shrink that surface, not grow it. In order of preference: (1) adjust the query to fit an existing index — add the missing prefix field, include the `partialFilterExpression` qualifier, drop the offending sort, switch to cursor pagination, add a `projection`, etc.; name the index by `name:` if it has one, else by key spec. (2) Extend or replace an existing index so it covers both this query and its current callers (e.g. add one field to a compound) — name the index to modify and the new shape. (3) Only as a last resort, recommend a new index in `src/packages/mongo-indexes/src/.ts`, and explicitly say why (1) and (2) don't work.}} Where `{{SEVERITY_EMOJI}}` / `{{SEVERITY_WORD}}` is one of `🔴 critical`, `🟠 high`, `🟡 medium`, `🟢 low`. @@ -131,9 +156,9 @@ Decide the review event: When submitting, include a brief summary body — at most 4 short bullets covering: total findings count, breakdown by severity, and any cross-cutting recommendation (e.g. "Consider adding a compound index on `{userId, actorTaskId}` for these three queries"). End the summary body with the line `cc @mtrunkat` so they get notified of every review with findings. -### 4. Persist the result +### 5. Persist the result -After submitting the review (or after deciding no review is needed), write the maximum severity to `$RESULT_PATH` as a single lowercase word with **no whitespace and no newline**. Examples: `none`, `low`, `medium`, `high`, `critical`. **Use the `Write` tool** — bash output redirection (`>`, `>>`) is blocked by the sandbox even for paths inside the workspace, so `printf > $RESULT_PATH` will fail. +After submitting the review (or after skipping submission per step 3), write the maximum severity to `$RESULT_PATH` as a single lowercase word with **no whitespace and no newline**. Examples: `none`, `low`, `medium`, `high`, `critical`. The severity must reflect **all findings you identified in step 2**, including ones dropped as duplicates — re-running the workflow must never silently green a previously-failing check. **Use the `Write` tool** — bash output redirection (`>`, `>>`) is blocked by the sandbox even for paths inside the workspace, so `printf > $RESULT_PATH` will fail. ## Bash sandbox notes