From dcf9faf5afdc48a0ca702f02782ce4bf9837bbbe Mon Sep 17 00:00:00 2001 From: Claude Date: Mon, 18 May 2026 13:17:48 +0000 Subject: [PATCH 1/6] feat: teach the prompt to flag sharded-skip and missing-hint queries MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two new patterns added to the action's prompt, sourced from recent internal post-mortems on apify-core slow queries: 1. **`.skip(N)` / `$skip` on a sharded collection without `readConcern: 'available'`.** MongoDB runs a `SHARDING_FILTER` stage that fetches every skipped document and checks it against the shard's chunk ranges to filter orphans. For large skips this dominates the query — measured ~60s on `Act2Runs` with skip=100k. Fix: pass `readConcern: 'available'` (or refactor to cursor pagination with `$gt: lastId`). Applies to `find().skip()`, `find(..., { skip: N })`, and `aggregate([{ $skip: N }, …])`. `countDocuments` is auto-handled by `ShardAwareCollection` and should not be flagged. 2. **Query on a heavy-index collection without `hint`.** When the collection's `mongo-indexes` file has many `ensureIndex` calls (e.g. `actor_jobs.ts` for `Act2Runs` has 40+), the planner can spend tens of seconds choosing a plan. Recommend `hint: ''` to pin selection. 🟡 medium. Includes one concrete example for #1 (the 100k-skip on Act2Runs case from the post-mortem). The severity rubric gains corresponding entries under 🟠 high (sharded-skip) and 🟡 medium (missing hint). --- mongodb-query-index-check/prompts/review.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/mongodb-query-index-check/prompts/review.md b/mongodb-query-index-check/prompts/review.md index a6ebc06..310080b 100644 --- a/mongodb-query-index-check/prompts/review.md +++ b/mongodb-query-index-check/prompts/review.md @@ -50,6 +50,8 @@ Reconstruct the before- and after-state of each MongoDB call and compare both ag 5. **Equality → negation** (`$ne`, `$nin`, `$not`, `$exists: false`). Forces a scan even on indexed fields. 6. **Unanchored `$regex` on an indexed field.** Only `/^prefix/` uses the index. 7. **Range filter before equality in a compound query.** ESR order: equality, sort, range. Reversing neutralises the index. +8. **`.skip(N)` / `$skip: N` on a sharded collection without `readConcern: 'available'`.** Sharded reads run a `SHARDING_FILTER` stage that fetches every skipped document and checks it against the shard's chunk ranges to filter orphans. For large skips this dominates the query (a 100k skip on `Act2Runs` was measured at ~60s in production). Fix: pass `readConcern: 'available'` in the call options (or refactor to cursor pagination with `$gt: lastId`). Applies to `find().skip()`, `find(..., { skip: N })`, and `aggregate([{ $skip: N }, …])`. **Do not flag `countDocuments`** — it's auto-handled by `ShardAwareCollection`. +9. **Query on a heavy-index collection without `hint`.** If the collection file in `$MONGO_INDEXES_DIR` has many `ensureIndex` calls (e.g. `actor_jobs.ts` for `Act2Runs` has 40+), the planner can spend tens of seconds evaluating candidate plans on cold caches. Recommend `hint: ''` to pin plan selection. The chosen index needs `name:` set; if it doesn't, recommend adding one first. 🟡 medium. ## Concrete examples @@ -57,6 +59,7 @@ Reconstruct the before- and after-state of each MongoDB call and compare both ag - **🟠 high** — same query without `removedAt: null`. Partial-filter index no longer applies. Name the dropped key in the comment. - **🟠 high** — `.sort({ 'profile.name': 1 })` added to a query whose chosen index is `{ userId: 1, finishedAt: 1 }`. In-memory sort. - **🟠 high** — `Act2Runs.find({ status })` on a collection sharded by `userId`. Fans out to every shard. +- **🟠 high** — `Act2Runs.find({ userId, actId, removedAt: null }, { sort: { startedAt: -1 }, limit: 100, skip: 100_000 })` on a sharded collection without `readConcern: 'available'`. The `SHARDING_FILTER` stage fetches and orphan-checks the 100k skipped docs (~60s observed). Pass `readConcern: 'available'`, or refactor to cursor pagination using `startedAt < lastStartedAt`. - **🟡 medium** — `$or` with one indexed branch and one unindexed branch — flag the unindexed branch. ## Severity classification @@ -69,11 +72,13 @@ Apply this ESR-aware (Equality → Sort → Range) rubric. Pick the highest appl - **Partial filter expression** of the matching index is incompatible with the query (e.g. index has `partialFilterExpression: { removedAt: null }` but the query also wants `removedAt: { $exists: true }`). - **Sort can't use the index** (sort field absent from the index, or appears before equality fields). - Query uses **`$regex` without an anchor** (`/^…/`) on an indexed field — the regex still won't use the index without an anchor. + - **`.skip(N)` / `$skip` on a sharded collection without `readConcern: 'available'`** — `SHARDING_FILTER` orphan-check dominates query time for large skips. - 🟡 **medium** — An index exists and is used, but is **likely inefficient**: - Filter is on a **low-selectivity** field only (e.g. only on a boolean or status enum that covers most documents) and there's no further filter to narrow it. - **Read/return ratio likely poor**: index scans many docs the query then discards (e.g. range-then-equality compound order, or `$ne` / `$nin` on an indexed field). - Sort uses the index but the direction doesn't match (compound index would need a reverse scan). - Multiple `$or` branches and at least one has no usable index. + - Query on a collection with **many overlapping indexes** lacks `hint:` — planner spends tens of seconds choosing a plan. - 🟢 **low** — Stylistic / advisory: - Could tighten an existing `partialFilterExpression` to reduce index size. - Project to fewer fields to make this a covered query. From 0ce7b14dc2571b7ee9c561b1c8cd1f63359b52dd Mon Sep 17 00:00:00 2001 From: Claude Date: Mon, 18 May 2026 13:59:05 +0000 Subject: [PATCH 2/6] review: address mvolfik feedback on sharded-collection patterns MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two prompt rewrites prompted by inline review on #17: * **Pattern #3 (shard-key dropped).** Old wording read ambiguously as "flag when somebody drops the unique index" rather than "flag when a query omits the shard-key field". Rewritten to make the intent explicit (drop = field missing from filter / `$match`), and to name the canonical source for the shard key — the `ShardAwareCollection(, { shardKey: '', … })` constructor, which lives outside `$MONGO_INDEXES_DIR`. The `// For sharding …` index comment is now described as a signal rather than the source. Adds the explicit caveat that some sharded collections (e.g. `Datasets`, sharded on `_id`) won't have that signal at all and should be left un-flagged when the status is uncertain. * **Pattern #8 (sharded skip / readConcern).** Rewritten from "match these specific calls" to "here's the underlying mechanism, here are known examples, reason about new cases too": sharded chunks can hold orphan docs → `SHARDING_FILTER` stage fetches every candidate to drop them → `readConcern: 'available'` bypasses it. The named examples (skip, countDocuments-auto-handled, wide aggregation ranges) now sit under that explanation rather than being the rule itself, which lets the model spot novel index-only-on-shard regressions instead of just the patterns we already enumerated. Cursor-pagination guidance now consistently uses the sort key (`startedAt < lastStartedAt`), matching the concrete example below. --- mongodb-query-index-check/prompts/review.md | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/mongodb-query-index-check/prompts/review.md b/mongodb-query-index-check/prompts/review.md index 310080b..04fc83e 100644 --- a/mongodb-query-index-check/prompts/review.md +++ b/mongodb-query-index-check/prompts/review.md @@ -45,12 +45,17 @@ Reconstruct the before- and after-state of each MongoDB call and compare both ag 1. **Partial-filter qualifier dropped or rewritten.** If a candidate index has `partialFilterExpression: { removedAt: null }`, the after-query must include every key. Removing or rewriting it (e.g. `removedAt: null` → `disabledAt: { $ne: null }`) makes the index unusable. 2. **New `sort` outside the chosen index's ESR position.** Sort field must follow all equality fields in the index, else in-memory sort. -3. **Shard-key prefix dropped.** Sharded collections in `$MONGO_INDEXES_DIR` are marked with `// For sharding ...` next to a unique index whose first field is the shard key (often `userId`). Dropping it fans out to every shard. +3. **Sharded collection query omits the shard key from its filter / `$match`.** Sharded queries should include the shard key (in the `find()` filter, the filter argument of `findOneAnd*` / `updateOne` / `deleteOne` / etc., or the first `$match` of an `aggregate`) — without it, MongoDB fans the query out to every shard. The canonical source for the shard key is the `ShardAwareCollection(, { shardKey: '', ... })` constructor in the collection wrapper, which isn't visible from `$MONGO_INDEXES_DIR`. The reliable signal you *do* have: a `// For sharding ` comment in the index file, usually next to a unique index whose first field is the shard key. Some sharded collections (e.g. `Datasets`, sharded on `_id`) won't have that signal — when in doubt about sharding status, prefer not to flag. 4. **New or modified `$or` branch.** Each branch is matched independently. Flag any branch without an index. 5. **Equality → negation** (`$ne`, `$nin`, `$not`, `$exists: false`). Forces a scan even on indexed fields. 6. **Unanchored `$regex` on an indexed field.** Only `/^prefix/` uses the index. 7. **Range filter before equality in a compound query.** ESR order: equality, sort, range. Reversing neutralises the index. -8. **`.skip(N)` / `$skip: N` on a sharded collection without `readConcern: 'available'`.** Sharded reads run a `SHARDING_FILTER` stage that fetches every skipped document and checks it against the shard's chunk ranges to filter orphans. For large skips this dominates the query (a 100k skip on `Act2Runs` was measured at ~60s in production). Fix: pass `readConcern: 'available'` in the call options (or refactor to cursor pagination with `$gt: lastId`). Applies to `find().skip()`, `find(..., { skip: N })`, and `aggregate([{ $skip: N }, …])`. **Do not flag `countDocuments`** — it's auto-handled by `ShardAwareCollection`. +8. **Sharded-collection operation that should pass `readConcern: 'available'`.** Sharded chunks can contain orphan documents (residual from chunk migrations). For operations that would normally be served from the index alone, MongoDB has to fetch every candidate document and run a `SHARDING_FILTER` stage to drop orphans — expensive at scale. Passing `readConcern: 'available'` bypasses that filter at the cost of possibly including a few orphans (fine for most read paths). Known examples: + - **`.skip(N)` / `$skip: N`** — must fetch every skipped doc for the orphan check; measured ~60s on `Act2Runs` at skip=100k. Suggest `readConcern: 'available'` or cursor pagination using the sort key (e.g. `startedAt < lastStartedAt` for a `sort: { startedAt: -1 }` cursor). + - **`.countDocuments()`** — same problem, but **don't flag**: `ShardAwareCollection` already wraps it with `readConcern: 'available'`. + - **`aggregate` stages that traverse a wide index range without narrowing down** (`$count`, `$group` over many keys) — likely same problem. + + Reason from the underlying principle for novel cases too: any operation that would be index-only on a non-sharded collection but suddenly needs document fetches on a sharded one is a candidate. 9. **Query on a heavy-index collection without `hint`.** If the collection file in `$MONGO_INDEXES_DIR` has many `ensureIndex` calls (e.g. `actor_jobs.ts` for `Act2Runs` has 40+), the planner can spend tens of seconds evaluating candidate plans on cold caches. Recommend `hint: ''` to pin plan selection. The chosen index needs `name:` set; if it doesn't, recommend adding one first. 🟡 medium. ## Concrete examples From 7479cd4ee1b874f59c082bab6e0792c513a63f71 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Marek=20Trunk=C3=A1t?= Date: Tue, 19 May 2026 10:49:01 +0200 Subject: [PATCH 3/6] Claude/fix mongodb duplicates x v56n (#18) Co-authored-by: Claude --- mongodb-query-index-check/prompts/review.md | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/mongodb-query-index-check/prompts/review.md b/mongodb-query-index-check/prompts/review.md index 04fc83e..96f13d0 100644 --- a/mongodb-query-index-check/prompts/review.md +++ b/mongodb-query-index-check/prompts/review.md @@ -114,7 +114,18 @@ Guardrails: - Skip silently if the collection file doesn't exist in `$MONGO_INDEXES_DIR`, or you can't determine the collection reliably. - No findings → write `none` to `$RESULT_PATH` and exit silently. -### 3. Post the review (only when there is at least one finding) +### 3. Deduplicate against previous runs + +A previous run on this PR may have already posted some of these findings. Filter duplicates before posting: + +1. Call `mcp__github__pull_request_read` with `method: get_review_comments` and `pullNumber: $PR_NUMBER`. The response is a list of review threads; each has `is_outdated` and a `comments` array (every comment has `author`, `body`, `path`, `line`). +2. Collect comments whose `author` is `github-actions`, whose containing thread is not `is_outdated`, and whose `body` starts with the template prefix defined in step 4 (any of the four severity emojis followed by ` **MongoDB index check`). +3. Drop any finding whose `(path, line)` matches a collected comment. Outdated threads don't count — line numbers shift across commits, so a fresh finding on a shifted line is a new finding. +4. If no new findings remain, compute the max severity from the original (un-filtered) findings list, write it to `$RESULT_PATH`, and exit silently. Do not submit an empty review. + +If `get_review_comments` errors, proceed without dedup — a duplicate review is better than no review. + +### 4. Post the review (only when at least one new finding remains) Use whichever GitHub MCP review tools are available. Common shapes: @@ -141,9 +152,9 @@ Decide the review event: When submitting, include a brief summary body — at most 4 short bullets covering: total findings count, breakdown by severity, and any cross-cutting recommendation (e.g. "Consider adding a compound index on `{userId, actorTaskId}` for these three queries"). End the summary body with the line `cc @mtrunkat` so they get notified of every review with findings. -### 4. Persist the result +### 5. Persist the result -After submitting the review (or after deciding no review is needed), write the maximum severity to `$RESULT_PATH` as a single lowercase word with **no whitespace and no newline**. Examples: `none`, `low`, `medium`, `high`, `critical`. **Use the `Write` tool** — bash output redirection (`>`, `>>`) is blocked by the sandbox even for paths inside the workspace, so `printf > $RESULT_PATH` will fail. +After submitting the review (or after skipping submission per step 3), write the maximum severity to `$RESULT_PATH` as a single lowercase word with **no whitespace and no newline**. Examples: `none`, `low`, `medium`, `high`, `critical`. The severity must reflect **all findings you identified in step 2**, including ones dropped as duplicates — re-running the workflow must never silently green a previously-failing check. **Use the `Write` tool** — bash output redirection (`>`, `>>`) is blocked by the sandbox even for paths inside the workspace, so `printf > $RESULT_PATH` will fail. ## Bash sandbox notes From e00ffe59eb341df80654f3ee5acf9458d627c9cb Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 19 May 2026 08:55:36 +0000 Subject: [PATCH 4/6] feat: flag missing projection + prefer reusing indexes over adding new ones MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two additions sourced from team feedback while reviewing this PR: * **Missing projection pattern** (apify/apify-core#19762). Adds a new "Read without projection" entry to the patterns checklist, a 🟡 medium severity bullet, and a Users.findOne example. Flags find/findOne/findOneAnd* reads that omit projection and waste IO + BSON parse time on multi-KB documents. Skips writes and aggregate (which shapes via $project). * **Prefer reusing existing indexes.** From the Slack thread on this PR — busy collections already have dozens of indexes and the goal is to shrink that surface, not grow it. Rewrites the "Suggested action" template in step 4 with a 3-tier preference: (1) adjust the query to fit an existing index, (2) extend or replace an existing index, (3) only as a last resort recommend a new one, with explicit reasoning why (1) and (2) don't work. --- mongodb-query-index-check/prompts/review.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mongodb-query-index-check/prompts/review.md b/mongodb-query-index-check/prompts/review.md index 96f13d0..c8d12a1 100644 --- a/mongodb-query-index-check/prompts/review.md +++ b/mongodb-query-index-check/prompts/review.md @@ -57,6 +57,7 @@ Reconstruct the before- and after-state of each MongoDB call and compare both ag Reason from the underlying principle for novel cases too: any operation that would be index-only on a non-sharded collection but suddenly needs document fetches on a sharded one is a candidate. 9. **Query on a heavy-index collection without `hint`.** If the collection file in `$MONGO_INDEXES_DIR` has many `ensureIndex` calls (e.g. `actor_jobs.ts` for `Act2Runs` has 40+), the planner can spend tens of seconds evaluating candidate plans on cold caches. Recommend `hint: ''` to pin plan selection. The chosen index needs `name:` set; if it doesn't, recommend adding one first. 🟡 medium. +10. **Read without `projection`.** `find()`, `findOne()`, and `findOneAnd*()` calls that omit a `projection` (either in the options object or via a chained `.project()`) return every field of every matched document — wasted IO, network bandwidth, and BSON parse time. Documents in `Users`, `Act2Runs`, etc. are easily 10–100 KB, and most callers only need a handful of fields. Flag any added/modified read that lacks a projection; suggest the smallest projection covering the fields actually consumed downstream (and just `{ _id: 1 }` for existence checks). Skip writes (`updateOne`, `deleteOne`, `bulkWrite`) and `aggregate` (which shapes via `$project` and is often a full transformation anyway). 🟡 medium. ## Concrete examples @@ -66,6 +67,7 @@ Reconstruct the before- and after-state of each MongoDB call and compare both ag - **🟠 high** — `Act2Runs.find({ status })` on a collection sharded by `userId`. Fans out to every shard. - **🟠 high** — `Act2Runs.find({ userId, actId, removedAt: null }, { sort: { startedAt: -1 }, limit: 100, skip: 100_000 })` on a sharded collection without `readConcern: 'available'`. The `SHARDING_FILTER` stage fetches and orphan-checks the 100k skipped docs (~60s observed). Pass `readConcern: 'available'`, or refactor to cursor pagination using `startedAt < lastStartedAt`. - **🟡 medium** — `$or` with one indexed branch and one unindexed branch — flag the unindexed branch. +- **🟡 medium** — `Users.findOne({ _id: id })` with no projection. Returns the full ~50 KB user document when the caller only reads `.username`. Suggest `{ projection: { username: 1 } }` (or `{ _id: 1 }` if it's an existence check). ## Severity classification @@ -84,6 +86,7 @@ Apply this ESR-aware (Equality → Sort → Range) rubric. Pick the highest appl - Sort uses the index but the direction doesn't match (compound index would need a reverse scan). - Multiple `$or` branches and at least one has no usable index. - Query on a collection with **many overlapping indexes** lacks `hint:` — planner spends tens of seconds choosing a plan. + - Read returns the full document because `projection` was omitted — wastes IO and BSON parse time for collections with multi-KB documents. - 🟢 **low** — Stylistic / advisory: - Could tighten an existing `partialFilterExpression` to reduce index size. - Project to fewer fields to make this a covered query. @@ -141,7 +144,7 @@ For each finding, point `path` and `line` at the **after-state RIGHT line** of t {{1–3 sentences describing the issue. Reference the specific filter fields and the specific existing index by its `fields + name` if it has a name. Be concrete.}} - **Suggested action**: {{Either point to an existing index that should be used and what to adjust in the query, or recommend adding a new index in `src/packages/mongo-indexes/src/.ts`. Be specific about which fields and partial-filter expressions.}} + **Suggested action**: {{Prefer fixes that don't add new indexes — busy collections (`Users`, `Act2Runs`, …) already have dozens and the explicit goal is to shrink that surface, not grow it. In order of preference: (1) adjust the query to fit an existing index — add the missing prefix field, include the `partialFilterExpression` qualifier, drop the offending sort, switch to cursor pagination, add a `projection`, etc.; name the index by `name:` if it has one, else by key spec. (2) Extend or replace an existing index so it covers both this query and its current callers (e.g. add one field to a compound) — name the index to modify and the new shape. (3) Only as a last resort, recommend a new index in `src/packages/mongo-indexes/src/.ts`, and explicitly say why (1) and (2) don't work.}} Where `{{SEVERITY_EMOJI}}` / `{{SEVERITY_WORD}}` is one of `🔴 critical`, `🟠 high`, `🟡 medium`, `🟢 low`. From 2e0556165c06dd2dea88055d067958c5beff9c57 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 19 May 2026 09:16:59 +0000 Subject: [PATCH 5/6] review: fix ShardAwareCollection refs (mvolfik feedback) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Real constructor signature: `new ShardAwareCollection(collection, shardKeys[], openTelemetry?)`, used like `new ShardAwareCollection(this.Act2Runs, ['userId', '_id'], openTelemetry)`. Old version was hallucinated. * Extend sparse-checkout to also fetch `src/packages/mongo-connection/src/shard_aware_collection` so Claude can `Read` the file directly. New `$SHARD_AWARE_COLLECTION_DIR` env var surfaced in the prompt; empty when the file isn't present on the runner. * Drop the "don't flag `countDocuments`" exception — `ShardAwareCollection` actually exposes `approximateCountDocuments` (which wraps `readConcern: 'available'`), not `countDocuments`. Plain `.countDocuments()` on a sharded raw collection is suspect and should be flagged. * Step 1: loosen the "don't read outside `$MONGO_INDEXES_DIR`" rule to allow reading `$SHARD_AWARE_COLLECTION_DIR` and grepping the workspace for `new ShardAwareCollection<`. --- mongodb-query-index-check/action.yaml | 20 ++++++++++++++++---- mongodb-query-index-check/prompts/review.md | 8 +++++--- 2 files changed, 21 insertions(+), 7 deletions(-) diff --git a/mongodb-query-index-check/action.yaml b/mongodb-query-index-check/action.yaml index ca11485..6a32ba8 100644 --- a/mongodb-query-index-check/action.yaml +++ b/mongodb-query-index-check/action.yaml @@ -107,7 +107,9 @@ runs: ref: develop token: ${{ inputs.apify-core-token }} path: __mongo_index_check_apify_core - sparse-checkout: src/packages/mongo-indexes/src + sparse-checkout: | + src/packages/mongo-indexes/src + src/packages/mongo-connection/src/shard_aware_collection sparse-checkout-cone-mode: false fetch-depth: 1 @@ -119,12 +121,14 @@ runs: run: | set -euo pipefail if [ -n "$APIFY_CORE_TOKEN" ]; then - indexes_dir="${GITHUB_WORKSPACE}/__mongo_index_check_apify_core/src/packages/mongo-indexes/src" + repo_root="${GITHUB_WORKSPACE}/__mongo_index_check_apify_core" origin_label="apify-core@develop" else - indexes_dir="${GITHUB_WORKSPACE}/src/packages/mongo-indexes/src" + repo_root="${GITHUB_WORKSPACE}" origin_label="local workspace (assuming caller is apify-core)" fi + indexes_dir="${repo_root}/src/packages/mongo-indexes/src" + shard_aware_dir="${repo_root}/src/packages/mongo-connection/src/shard_aware_collection" if [ ! -d "$indexes_dir" ]; then echo "::error::Could not find mongo-indexes source directory at: $indexes_dir" exit 1 @@ -132,6 +136,14 @@ runs: file_count=$(find "$indexes_dir" -maxdepth 2 -type f -name '*.ts' | wc -l) echo "Resolved mongo-indexes from ${origin_label}: ${indexes_dir} (${file_count} .ts file(s))." echo "MONGO_INDEXES_DIR=${indexes_dir}" >> "$GITHUB_ENV" + # Optional: surface the ShardAwareCollection source so Claude can verify shard keys directly. + # Skipped silently if absent (older apify-core ref, partial sparse checkout, etc.). + if [ -d "$shard_aware_dir" ]; then + echo "Resolved ShardAwareCollection sources at ${shard_aware_dir}." + echo "SHARD_AWARE_COLLECTION_DIR=${shard_aware_dir}" >> "$GITHUB_ENV" + else + echo "SHARD_AWARE_COLLECTION_DIR=" >> "$GITHUB_ENV" + fi - name: Render Claude prompt id: render @@ -146,7 +158,7 @@ runs: run: | set -euo pipefail prompt_file="${RUNNER_TEMP}/mongo-index-prompt.md" - envsubst '$PR_NUMBER $REPO $BASE_SHA $HEAD_SHA $MONGO_INDEXES_DIR $CHANGED_FILES_PATH $RESULT_PATH $REQUEST_CHANGES_MODE' \ + envsubst '$PR_NUMBER $REPO $BASE_SHA $HEAD_SHA $MONGO_INDEXES_DIR $SHARD_AWARE_COLLECTION_DIR $CHANGED_FILES_PATH $RESULT_PATH $REQUEST_CHANGES_MODE' \ < "${GITHUB_ACTION_PATH}/prompts/review.md" \ > "$prompt_file" delimiter="EOF_${RANDOM}${RANDOM}${RANDOM}" diff --git a/mongodb-query-index-check/prompts/review.md b/mongodb-query-index-check/prompts/review.md index c8d12a1..8a240ff 100644 --- a/mongodb-query-index-check/prompts/review.md +++ b/mongodb-query-index-check/prompts/review.md @@ -16,6 +16,7 @@ You are NOT here to do a general code review. Stay focused on indexes. - Files inside it are TypeScript sources, one per collection (e.g. `users.ts`, `actor_jobs.ts`, `request_queues.ts`). - Each file calls `await ensureIndex(Collection, { fields... }, { options... })`. Read these files to learn what indexes exist for each collection, including their `partialFilterExpression`, `unique`, `sparse`, `collation`, and `name` options. - The `index.ts` and `ensure_index.ts` files are infrastructure, not index definitions — skip them. +- **ShardAwareCollection sources** (canonical source for shard keys): `$SHARD_AWARE_COLLECTION_DIR` (may be empty if unavailable on this runner). When set, contains `shard_aware_collection.ts` and `filter_types.ts` from `apify/apify-core`'s `mongo-connection` package. Read the file when you need to confirm whether a collection is sharded and on which keys — `grep` for `new ShardAwareCollection<` in the apify-core source separately to find usages. - **Changed source files in this PR** (JSON array of paths, already filtered to source files outside test/build/vendor dirs): `$CHANGED_FILES_PATH` - **Request-changes mode**: `$REQUEST_CHANGES_MODE`. When `true`, the workflow will fail on any finding; when `false`, comments only. - **Result file** you must write the maximum severity to: `$RESULT_PATH`. Allowed contents: one of `none`, `low`, `medium`, `high`, `critical` — single word, no whitespace, no trailing newline. @@ -45,14 +46,14 @@ Reconstruct the before- and after-state of each MongoDB call and compare both ag 1. **Partial-filter qualifier dropped or rewritten.** If a candidate index has `partialFilterExpression: { removedAt: null }`, the after-query must include every key. Removing or rewriting it (e.g. `removedAt: null` → `disabledAt: { $ne: null }`) makes the index unusable. 2. **New `sort` outside the chosen index's ESR position.** Sort field must follow all equality fields in the index, else in-memory sort. -3. **Sharded collection query omits the shard key from its filter / `$match`.** Sharded queries should include the shard key (in the `find()` filter, the filter argument of `findOneAnd*` / `updateOne` / `deleteOne` / etc., or the first `$match` of an `aggregate`) — without it, MongoDB fans the query out to every shard. The canonical source for the shard key is the `ShardAwareCollection(, { shardKey: '', ... })` constructor in the collection wrapper, which isn't visible from `$MONGO_INDEXES_DIR`. The reliable signal you *do* have: a `// For sharding ` comment in the index file, usually next to a unique index whose first field is the shard key. Some sharded collections (e.g. `Datasets`, sharded on `_id`) won't have that signal — when in doubt about sharding status, prefer not to flag. +3. **Sharded collection query omits the shard key from its filter / `$match`.** Sharded queries should include the shard key (in the `find()` filter, the filter argument of `findOneAnd*` / `updateOne` / `deleteOne` / etc., or the first `$match` of an `aggregate`) — without it, MongoDB fans the query out to every shard. The canonical source for shard keys is the `ShardAwareCollection` class — its constructor signature is `new ShardAwareCollection(collection, shardKeys, openTelemetry?)`, called like `new ShardAwareCollection(this.Act2Runs, ['userId', '_id'], openTelemetry)` (see `$SHARD_AWARE_COLLECTION_DIR/shard_aware_collection.ts` when that env var is set). Grep `new ShardAwareCollection<` across the apify-core source to find usages and read off the `shardKeys` array; the secondary signal in `$MONGO_INDEXES_DIR` is a `// For sharding ` comment next to a unique index. When you can't confirm sharding from either source, prefer not to flag. 4. **New or modified `$or` branch.** Each branch is matched independently. Flag any branch without an index. 5. **Equality → negation** (`$ne`, `$nin`, `$not`, `$exists: false`). Forces a scan even on indexed fields. 6. **Unanchored `$regex` on an indexed field.** Only `/^prefix/` uses the index. 7. **Range filter before equality in a compound query.** ESR order: equality, sort, range. Reversing neutralises the index. 8. **Sharded-collection operation that should pass `readConcern: 'available'`.** Sharded chunks can contain orphan documents (residual from chunk migrations). For operations that would normally be served from the index alone, MongoDB has to fetch every candidate document and run a `SHARDING_FILTER` stage to drop orphans — expensive at scale. Passing `readConcern: 'available'` bypasses that filter at the cost of possibly including a few orphans (fine for most read paths). Known examples: - **`.skip(N)` / `$skip: N`** — must fetch every skipped doc for the orphan check; measured ~60s on `Act2Runs` at skip=100k. Suggest `readConcern: 'available'` or cursor pagination using the sort key (e.g. `startedAt < lastStartedAt` for a `sort: { startedAt: -1 }` cursor). - - **`.countDocuments()`** — same problem, but **don't flag**: `ShardAwareCollection` already wraps it with `readConcern: 'available'`. + - **Count operations** — `ShardAwareCollection.approximateCountDocuments()` wraps `readConcern: 'available'` already, so **don't flag it**. Plain `.countDocuments()` on the underlying `Collection` (or via `.rawCollection`) on a sharded collection does *not* — flag those. - **`aggregate` stages that traverse a wide index range without narrowing down** (`$count`, `$group` over many keys) — likely same problem. Reason from the underlying principle for novel cases too: any operation that would be index-only on a non-sharded collection but suddenly needs document fetches on a sharded one is a candidate. @@ -100,7 +101,8 @@ Follow these steps in order: - Read `$CHANGED_FILES_PATH` (a JSON array of file paths). Use `Read` or `cat`. - For each file, fetch its diff. Prefer the GitHub MCP server (e.g. `mcp__github__pull_request_read` with `get_diff` / `get_files`, passing owner+repo from `$REPO` and `pull_number: $PR_NUMBER`). If that tool isn't available, fall back to `gh pr diff $PR_NUMBER --repo $REPO -- ` via Bash. Focus on diff hunks that contain at least one added line — in those hunks, the after-state (added lines plus their unchanged context) is the surface to analyze; ignore hunks that contain only deletions. -- For each MongoDB call you find, identify the collection and read `$MONGO_INDEXES_DIR/.ts` to learn the available indexes. Use `Read` (preferred) or `cat` / `grep` / `find` on `$MONGO_INDEXES_DIR/*`. Don't read anything outside `$MONGO_INDEXES_DIR`. +- For each MongoDB call you find, identify the collection and read `$MONGO_INDEXES_DIR/.ts` to learn the available indexes. Use `Read` (preferred) or `cat` / `grep` / `find`. +- When you need to confirm whether a collection is sharded (or on which keys), `Read` `$SHARD_AWARE_COLLECTION_DIR/shard_aware_collection.ts` if `$SHARD_AWARE_COLLECTION_DIR` is non-empty, and grep the workspace for `new ShardAwareCollection<` to find collection-specific instantiations. ### 2. Decide findings From e7b8663e47e693c9d176147aea08d79e8f157d31 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 19 May 2026 09:21:58 +0000 Subject: [PATCH 6/6] review: drop the shard_aware_collection sparse-checkout MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Prior commit extended the sparse-checkout to pull `mongo-connection/src/shard_aware_collection`. That doesn't actually help — the file just defines the class. The *instantiations* (`new ShardAwareCollection(...)`) live at call sites scattered across apify-core, which Claude needs to grep for anyway. Revert the action.yaml additions (sparse-checkout entry, env var, envsubst placeholder) and the prompt's Input-data bullet. The Pattern #3 description keeps the constructor signature inline (it's stable enough to bake into the prompt) and step 1 still tells Claude to grep the workspace for `new ShardAwareCollection<` to find shard keys. --- mongodb-query-index-check/action.yaml | 20 ++++---------------- mongodb-query-index-check/prompts/review.md | 5 ++--- 2 files changed, 6 insertions(+), 19 deletions(-) diff --git a/mongodb-query-index-check/action.yaml b/mongodb-query-index-check/action.yaml index 6a32ba8..ca11485 100644 --- a/mongodb-query-index-check/action.yaml +++ b/mongodb-query-index-check/action.yaml @@ -107,9 +107,7 @@ runs: ref: develop token: ${{ inputs.apify-core-token }} path: __mongo_index_check_apify_core - sparse-checkout: | - src/packages/mongo-indexes/src - src/packages/mongo-connection/src/shard_aware_collection + sparse-checkout: src/packages/mongo-indexes/src sparse-checkout-cone-mode: false fetch-depth: 1 @@ -121,14 +119,12 @@ runs: run: | set -euo pipefail if [ -n "$APIFY_CORE_TOKEN" ]; then - repo_root="${GITHUB_WORKSPACE}/__mongo_index_check_apify_core" + indexes_dir="${GITHUB_WORKSPACE}/__mongo_index_check_apify_core/src/packages/mongo-indexes/src" origin_label="apify-core@develop" else - repo_root="${GITHUB_WORKSPACE}" + indexes_dir="${GITHUB_WORKSPACE}/src/packages/mongo-indexes/src" origin_label="local workspace (assuming caller is apify-core)" fi - indexes_dir="${repo_root}/src/packages/mongo-indexes/src" - shard_aware_dir="${repo_root}/src/packages/mongo-connection/src/shard_aware_collection" if [ ! -d "$indexes_dir" ]; then echo "::error::Could not find mongo-indexes source directory at: $indexes_dir" exit 1 @@ -136,14 +132,6 @@ runs: file_count=$(find "$indexes_dir" -maxdepth 2 -type f -name '*.ts' | wc -l) echo "Resolved mongo-indexes from ${origin_label}: ${indexes_dir} (${file_count} .ts file(s))." echo "MONGO_INDEXES_DIR=${indexes_dir}" >> "$GITHUB_ENV" - # Optional: surface the ShardAwareCollection source so Claude can verify shard keys directly. - # Skipped silently if absent (older apify-core ref, partial sparse checkout, etc.). - if [ -d "$shard_aware_dir" ]; then - echo "Resolved ShardAwareCollection sources at ${shard_aware_dir}." - echo "SHARD_AWARE_COLLECTION_DIR=${shard_aware_dir}" >> "$GITHUB_ENV" - else - echo "SHARD_AWARE_COLLECTION_DIR=" >> "$GITHUB_ENV" - fi - name: Render Claude prompt id: render @@ -158,7 +146,7 @@ runs: run: | set -euo pipefail prompt_file="${RUNNER_TEMP}/mongo-index-prompt.md" - envsubst '$PR_NUMBER $REPO $BASE_SHA $HEAD_SHA $MONGO_INDEXES_DIR $SHARD_AWARE_COLLECTION_DIR $CHANGED_FILES_PATH $RESULT_PATH $REQUEST_CHANGES_MODE' \ + envsubst '$PR_NUMBER $REPO $BASE_SHA $HEAD_SHA $MONGO_INDEXES_DIR $CHANGED_FILES_PATH $RESULT_PATH $REQUEST_CHANGES_MODE' \ < "${GITHUB_ACTION_PATH}/prompts/review.md" \ > "$prompt_file" delimiter="EOF_${RANDOM}${RANDOM}${RANDOM}" diff --git a/mongodb-query-index-check/prompts/review.md b/mongodb-query-index-check/prompts/review.md index 8a240ff..3fbfa31 100644 --- a/mongodb-query-index-check/prompts/review.md +++ b/mongodb-query-index-check/prompts/review.md @@ -16,7 +16,6 @@ You are NOT here to do a general code review. Stay focused on indexes. - Files inside it are TypeScript sources, one per collection (e.g. `users.ts`, `actor_jobs.ts`, `request_queues.ts`). - Each file calls `await ensureIndex(Collection, { fields... }, { options... })`. Read these files to learn what indexes exist for each collection, including their `partialFilterExpression`, `unique`, `sparse`, `collation`, and `name` options. - The `index.ts` and `ensure_index.ts` files are infrastructure, not index definitions — skip them. -- **ShardAwareCollection sources** (canonical source for shard keys): `$SHARD_AWARE_COLLECTION_DIR` (may be empty if unavailable on this runner). When set, contains `shard_aware_collection.ts` and `filter_types.ts` from `apify/apify-core`'s `mongo-connection` package. Read the file when you need to confirm whether a collection is sharded and on which keys — `grep` for `new ShardAwareCollection<` in the apify-core source separately to find usages. - **Changed source files in this PR** (JSON array of paths, already filtered to source files outside test/build/vendor dirs): `$CHANGED_FILES_PATH` - **Request-changes mode**: `$REQUEST_CHANGES_MODE`. When `true`, the workflow will fail on any finding; when `false`, comments only. - **Result file** you must write the maximum severity to: `$RESULT_PATH`. Allowed contents: one of `none`, `low`, `medium`, `high`, `critical` — single word, no whitespace, no trailing newline. @@ -46,7 +45,7 @@ Reconstruct the before- and after-state of each MongoDB call and compare both ag 1. **Partial-filter qualifier dropped or rewritten.** If a candidate index has `partialFilterExpression: { removedAt: null }`, the after-query must include every key. Removing or rewriting it (e.g. `removedAt: null` → `disabledAt: { $ne: null }`) makes the index unusable. 2. **New `sort` outside the chosen index's ESR position.** Sort field must follow all equality fields in the index, else in-memory sort. -3. **Sharded collection query omits the shard key from its filter / `$match`.** Sharded queries should include the shard key (in the `find()` filter, the filter argument of `findOneAnd*` / `updateOne` / `deleteOne` / etc., or the first `$match` of an `aggregate`) — without it, MongoDB fans the query out to every shard. The canonical source for shard keys is the `ShardAwareCollection` class — its constructor signature is `new ShardAwareCollection(collection, shardKeys, openTelemetry?)`, called like `new ShardAwareCollection(this.Act2Runs, ['userId', '_id'], openTelemetry)` (see `$SHARD_AWARE_COLLECTION_DIR/shard_aware_collection.ts` when that env var is set). Grep `new ShardAwareCollection<` across the apify-core source to find usages and read off the `shardKeys` array; the secondary signal in `$MONGO_INDEXES_DIR` is a `// For sharding ` comment next to a unique index. When you can't confirm sharding from either source, prefer not to flag. +3. **Sharded collection query omits the shard key from its filter / `$match`.** Sharded queries should include the shard key (in the `find()` filter, the filter argument of `findOneAnd*` / `updateOne` / `deleteOne` / etc., or the first `$match` of an `aggregate`) — without it, MongoDB fans the query out to every shard. The canonical source for shard keys is the `ShardAwareCollection` class in `apify/apify-core`'s `mongo-connection` package — its constructor takes the shard keys as the second positional argument: `new ShardAwareCollection(collection, shardKeys, openTelemetry?)`, used like `new ShardAwareCollection(this.Act2Runs, ['userId', '_id'], openTelemetry)` — meaning `Act2Runs` is sharded on `[userId, _id]`. To find shard keys for a collection, grep the workspace for `new ShardAwareCollection<` and read off the array literal at the matching usage. Secondary signal: a `// For sharding ` comment in the matching `$MONGO_INDEXES_DIR/.ts`. When neither source confirms sharding, prefer not to flag. 4. **New or modified `$or` branch.** Each branch is matched independently. Flag any branch without an index. 5. **Equality → negation** (`$ne`, `$nin`, `$not`, `$exists: false`). Forces a scan even on indexed fields. 6. **Unanchored `$regex` on an indexed field.** Only `/^prefix/` uses the index. @@ -102,7 +101,7 @@ Follow these steps in order: - Read `$CHANGED_FILES_PATH` (a JSON array of file paths). Use `Read` or `cat`. - For each file, fetch its diff. Prefer the GitHub MCP server (e.g. `mcp__github__pull_request_read` with `get_diff` / `get_files`, passing owner+repo from `$REPO` and `pull_number: $PR_NUMBER`). If that tool isn't available, fall back to `gh pr diff $PR_NUMBER --repo $REPO -- ` via Bash. Focus on diff hunks that contain at least one added line — in those hunks, the after-state (added lines plus their unchanged context) is the surface to analyze; ignore hunks that contain only deletions. - For each MongoDB call you find, identify the collection and read `$MONGO_INDEXES_DIR/.ts` to learn the available indexes. Use `Read` (preferred) or `cat` / `grep` / `find`. -- When you need to confirm whether a collection is sharded (or on which keys), `Read` `$SHARD_AWARE_COLLECTION_DIR/shard_aware_collection.ts` if `$SHARD_AWARE_COLLECTION_DIR` is non-empty, and grep the workspace for `new ShardAwareCollection<` to find collection-specific instantiations. +- When you need to confirm whether a collection is sharded (or on which keys), grep the workspace for `new ShardAwareCollection<` and read off the shard-key array from the matching usage. ### 2. Decide findings