From dcf9faf5afdc48a0ca702f02782ce4bf9837bbbe Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Mon, 18 May 2026 13:17:48 +0000
Subject: [PATCH 1/6] feat: teach the prompt to flag sharded-skip and
 missing-hint queries
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two new patterns added to the action's prompt, sourced from recent
internal post-mortems on apify-core slow queries:

1. **`.skip(N)` / `$skip` on a sharded collection without
   `readConcern: 'available'`.** MongoDB runs a `SHARDING_FILTER` stage
   that fetches every skipped document and checks it against the
   shard's chunk ranges to filter orphans. For large skips this
   dominates the query — measured ~60s on `Act2Runs` with skip=100k.
   Fix: pass `readConcern: 'available'` (or refactor to cursor
   pagination with `$gt: lastId`). Applies to `find().skip()`,
   `find(..., { skip: N })`, and `aggregate([{ $skip: N }, …])`.
   `countDocuments` is auto-handled by `ShardAwareCollection` and
   should not be flagged.

2. **Query on a heavy-index collection without `hint`.** When the
   collection's `mongo-indexes` file has many `ensureIndex` calls
   (e.g. `actor_jobs.ts` for `Act2Runs` has 40+), the planner can
   spend tens of seconds choosing a plan. Recommend
   `hint: '<index name>'` to pin selection. 🟡 medium.

Includes one concrete example for #1 (the 100k-skip on Act2Runs case
from the post-mortem). The severity rubric gains corresponding entries
under 🟠 high (sharded-skip) and 🟡 medium (missing hint).
---
 mongodb-query-index-check/prompts/review.md | 5 +++++
 1 file changed, 5 insertions(+)
diff --git a/mongodb-query-index-check/prompts/review.md b/mongodb-query-index-check/prompts/review.md
index a6ebc06..310080b 100644
--- a/mongodb-query-index-check/prompts/review.md
+++ b/mongodb-query-index-check/prompts/review.md
@@ -50,6 +50,8 @@ Reconstruct the before- and after-state of each MongoDB call and compare both ag
 5. **Equality → negation** (`$ne`, `$nin`, `$not`, `$exists: false`). Forces a scan even on indexed fields.
 6. **Unanchored `$regex` on an indexed field.** Only `/^prefix/` uses the index.
 7. **Range filter before equality in a compound query.** ESR order: equality, sort, range. Reversing neutralises the index.
+8. **`.skip(N)` / `$skip: N` on a sharded collection without `readConcern: 'available'`.** Sharded reads run a `SHARDING_FILTER` stage that fetches every skipped document and checks it against the shard's chunk ranges to filter orphans. For large skips this dominates the query (a 100k skip on `Act2Runs` was measured at ~60s in production). Fix: pass `readConcern: 'available'` in the call options (or refactor to cursor pagination with `$gt: lastId`). Applies to `find().skip()`, `find(..., { skip: N })`, and `aggregate([{ $skip: N }, …])`. **Do not flag `countDocuments`** — it's auto-handled by `ShardAwareCollection`.
+9. **Query on a heavy-index collection without `hint`.** If the collection file in `$MONGO_INDEXES_DIR` has many `ensureIndex` calls (e.g. `actor_jobs.ts` for `Act2Runs` has 40+), the planner can spend tens of seconds evaluating candidate plans on cold caches. Recommend `hint: '<index name>'` to pin plan selection. The chosen index needs `name:` set; if it doesn't, recommend adding one first. 🟡 medium.
 
 ## Concrete examples
 
@@ -57,6 +59,7 @@ Reconstruct the before- and after-state of each MongoDB call and compare both ag
 - **🟠 high** — same query without `removedAt: null`. Partial-filter index no longer applies. Name the dropped key in the comment.
 - **🟠 high** — `.sort({ 'profile.name': 1 })` added to a query whose chosen index is `{ userId: 1, finishedAt: 1 }`. In-memory sort.
 - **🟠 high** — `Act2Runs.find({ status })` on a collection sharded by `userId`. Fans out to every shard.
+- **🟠 high** — `Act2Runs.find({ userId, actId, removedAt: null }, { sort: { startedAt: -1 }, limit: 100, skip: 100_000 })` on a sharded collection without `readConcern: 'available'`. The `SHARDING_FILTER` stage fetches and orphan-checks the 100k skipped docs (~60s observed). Pass `readConcern: 'available'`, or refactor to cursor pagination using `startedAt < lastStartedAt`.
 - **🟡 medium** — `$or` with one indexed branch and one unindexed branch — flag the unindexed branch.
 
 ## Severity classification
@@ -69,11 +72,13 @@ Apply this ESR-aware (Equality → Sort → Range) rubric. Pick the highest appl
   - **Partial filter expression** of the matching index is incompatible with the query (e.g. index has `partialFilterExpression: { removedAt: null }` but the query also wants `removedAt: { $exists: true }`).
   - **Sort can't use the index** (sort field absent from the index, or appears before equality fields).
   - Query uses **`$regex` without an anchor** (`/^…/`) on an indexed field — the regex still won't use the index without an anchor.
+  - **`.skip(N)` / `$skip` on a sharded collection without `readConcern: 'available'`** — `SHARDING_FILTER` orphan-check dominates query time for large skips.
 - 🟡 **medium** — An index exists and is used, but is **likely inefficient**:
   - Filter is on a **low-selectivity** field only (e.g. only on a boolean or status enum that covers most documents) and there's no further filter to narrow it.
   - **Read/return ratio likely poor**: index scans many docs the query then discards (e.g. range-then-equality compound order, or `$ne` / `$nin` on an indexed field).
   - Sort uses the index but the direction doesn't match (compound index would need a reverse scan).
   - Multiple `$or` branches and at least one has no usable index.
+  - Query on a collection with **many overlapping indexes** lacks `hint:` — planner spends tens of seconds choosing a plan.
 - 🟢 **low** — Stylistic / advisory:
   - Could tighten an existing `partialFilterExpression` to reduce index size.
   - Project to fewer fields to make this a covered query.

From 0ce7b14dc2571b7ee9c561b1c8cd1f63359b52dd Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Mon, 18 May 2026 13:59:05 +0000
Subject: [PATCH 2/6] review: address mvolfik feedback on sharded-collection
 patterns
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two prompt rewrites prompted by inline review on #17:

* **Pattern #3 (shard-key dropped).** Old wording read ambiguously as
  "flag when somebody drops the unique index" rather than "flag when a
  query omits the shard-key field". Rewritten to make the intent
  explicit (drop = field missing from filter / `$match`), and to name
  the canonical source for the shard key — the
  `ShardAwareCollection(<name>, { shardKey: '<key>', … })` constructor,
  which lives outside `$MONGO_INDEXES_DIR`. The `// For sharding …`
  index comment is now described as a signal rather than the source.
  Adds the explicit caveat that some sharded collections (e.g.
  `Datasets`, sharded on `_id`) won't have that signal at all and
  should be left un-flagged when the status is uncertain.

* **Pattern #8 (sharded skip / readConcern).** Rewritten from "match
  these specific calls" to "here's the underlying mechanism, here are
  known examples, reason about new cases too": sharded chunks can hold
  orphan docs → `SHARDING_FILTER` stage fetches every candidate to drop
  them → `readConcern: 'available'` bypasses it. The named examples
  (skip, countDocuments-auto-handled, wide aggregation ranges) now sit
  under that explanation rather than being the rule itself, which lets
  the model spot novel index-only-on-shard regressions instead of just
  the patterns we already enumerated. Cursor-pagination guidance now
  consistently uses the sort key (`startedAt < lastStartedAt`), matching
  the concrete example below.
---
 mongodb-query-index-check/prompts/review.md | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mongodb-query-index-check/prompts/review.md b/mongodb-query-index-check/prompts/review.md
index 310080b..04fc83e 100644
--- a/mongodb-query-index-check/prompts/review.md
+++ b/mongodb-query-index-check/prompts/review.md
@@ -45,12 +45,17 @@ Reconstruct the before- and after-state of each MongoDB call and compare both ag
 
 1. **Partial-filter qualifier dropped or rewritten.** If a candidate index has `partialFilterExpression: { removedAt: null }`, the after-query must include every key. Removing or rewriting it (e.g. `removedAt: null` → `disabledAt: { $ne: null }`) makes the index unusable.
 2. **New `sort` outside the chosen index's ESR position.** Sort field must follow all equality fields in the index, else in-memory sort.
-3. **Shard-key prefix dropped.** Sharded collections in `$MONGO_INDEXES_DIR` are marked with `// For sharding ...` next to a unique index whose first field is the shard key (often `userId`). Dropping it fans out to every shard.
+3. **Sharded collection query omits the shard key from its filter / `$match`.** Sharded queries should include the shard key (in the `find()` filter, the filter argument of `findOneAnd*` / `updateOne` / `deleteOne` / etc., or the first `$match` of an `aggregate`) — without it, MongoDB fans the query out to every shard. The canonical source for the shard key is the `ShardAwareCollection(<name>, { shardKey: '<key>', ... })` constructor in the collection wrapper, which isn't visible from `$MONGO_INDEXES_DIR`. The reliable signal you *do* have: a `// For sharding <Collection>` comment in the index file, usually next to a unique index whose first field is the shard key. Some sharded collections (e.g. `Datasets`, sharded on `_id`) won't have that signal — when in doubt about sharding status, prefer not to flag.
 4. **New or modified `$or` branch.** Each branch is matched independently. Flag any branch without an index.
 5. **Equality → negation** (`$ne`, `$nin`, `$not`, `$exists: false`). Forces a scan even on indexed fields.
 6. **Unanchored `$regex` on an indexed field.** Only `/^prefix/` uses the index.
 7. **Range filter before equality in a compound query.** ESR order: equality, sort, range. Reversing neutralises the index.
-8. **`.skip(N)` / `$skip: N` on a sharded collection without `readConcern: 'available'`.** Sharded reads run a `SHARDING_FILTER` stage that fetches every skipped document and checks it against the shard's chunk ranges to filter orphans. For large skips this dominates the query (a 100k skip on `Act2Runs` was measured at ~60s in production). Fix: pass `readConcern: 'available'` in the call options (or refactor to cursor pagination with `$gt: lastId`). Applies to `find().skip()`, `find(..., { skip: N })`, and `aggregate([{ $skip: N }, …])`. **Do not flag `countDocuments`** — it's auto-handled by `ShardAwareCollection`.
+8. **Sharded-collection operation that should pass `readConcern: 'available'`.** Sharded chunks can contain orphan documents (residual from chunk migrations). For operations that would normally be served from the index alone, MongoDB has to fetch every candidate document and run a `SHARDING_FILTER` stage to drop orphans — expensive at scale. Passing `readConcern: 'available'` bypasses that filter at the cost of possibly including a few orphans (fine for most read paths). Known examples:
+    - **`.skip(N)` / `$skip: N`** — must fetch every skipped doc for the orphan check; measured ~60s on `Act2Runs` at skip=100k. Suggest `readConcern: 'available'` or cursor pagination using the sort key (e.g. `startedAt < lastStartedAt` for a `sort: { startedAt: -1 }` cursor).
+    - **`.countDocuments()`** — same problem, but **don't flag**: `ShardAwareCollection` already wraps it with `readConcern: 'available'`.
+    - **`aggregate` stages that traverse a wide index range without narrowing down** (`$count`, `$group` over many keys) — likely same problem.
+
+    Reason from the underlying principle for novel cases too: any operation that would be index-only on a non-sharded collection but suddenly needs document fetches on a sharded one is a candidate.
 9. **Query on a heavy-index collection without `hint`.** If the collection file in `$MONGO_INDEXES_DIR` has many `ensureIndex` calls (e.g. `actor_jobs.ts` for `Act2Runs` has 40+), the planner can spend tens of seconds evaluating candidate plans on cold caches. Recommend `hint: '<index name>'` to pin plan selection. The chosen index needs `name:` set; if it doesn't, recommend adding one first. 🟡 medium.
 
 ## Concrete examples

From 7479cd4ee1b874f59c082bab6e0792c513a63f71 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Marek=20Trunk=C3=A1t?= <marek@trunkat.eu>
Date: Tue, 19 May 2026 10:49:01 +0200
Subject: [PATCH 3/6] Claude/fix mongodb duplicates x v56n (#18)

Co-authored-by: Claude <noreply@anthropic.com>
---
 mongodb-query-index-check/prompts/review.md | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/mongodb-query-index-check/prompts/review.md b/mongodb-query-index-check/prompts/review.md
index 04fc83e..96f13d0 100644
--- a/mongodb-query-index-check/prompts/review.md
+++ b/mongodb-query-index-check/prompts/review.md
@@ -114,7 +114,18 @@ Guardrails:
 - Skip silently if the collection file doesn't exist in `$MONGO_INDEXES_DIR`, or you can't determine the collection reliably.
 - No findings → write `none` to `$RESULT_PATH` and exit silently.
 
-### 3. Post the review (only when there is at least one finding)
+### 3. Deduplicate against previous runs
+
+A previous run on this PR may have already posted some of these findings. Filter duplicates before posting:
+
+1. Call `mcp__github__pull_request_read` with `method: get_review_comments` and `pullNumber: $PR_NUMBER`. The response is a list of review threads; each has `is_outdated` and a `comments` array (every comment has `author`, `body`, `path`, `line`).
+2. Collect comments whose `author` is `github-actions`, whose containing thread is not `is_outdated`, and whose `body` starts with the template prefix defined in step 4 (any of the four severity emojis followed by ` **MongoDB index check`).
+3. Drop any finding whose `(path, line)` matches a collected comment. Outdated threads don't count — line numbers shift across commits, so a fresh finding on a shifted line is a new finding.
+4. If no new findings remain, compute the max severity from the original (un-filtered) findings list, write it to `$RESULT_PATH`, and exit silently. Do not submit an empty review.
+
+If `get_review_comments` errors, proceed without dedup — a duplicate review is better than no review.
+
+### 4. Post the review (only when at least one new finding remains)
 
 Use whichever GitHub MCP review tools are available. Common shapes:
 
@@ -141,9 +152,9 @@ Decide the review event:
 
 When submitting, include a brief summary body — at most 4 short bullets covering: total findings count, breakdown by severity, and any cross-cutting recommendation (e.g. "Consider adding a compound index on `{userId, actorTaskId}` for these three queries"). End the summary body with the line `cc @mtrunkat` so they get notified of every review with findings.
 
-### 4. Persist the result
+### 5. Persist the result
 
-After submitting the review (or after deciding no review is needed), write the maximum severity to `$RESULT_PATH` as a single lowercase word with **no whitespace and no newline**. Examples: `none`, `low`, `medium`, `high`, `critical`. **Use the `Write` tool** — bash output redirection (`>`, `>>`) is blocked by the sandbox even for paths inside the workspace, so `printf > $RESULT_PATH` will fail.
+After submitting the review (or after skipping submission per step 3), write the maximum severity to `$RESULT_PATH` as a single lowercase word with **no whitespace and no newline**. Examples: `none`, `low`, `medium`, `high`, `critical`. The severity must reflect **all findings you identified in step 2**, including ones dropped as duplicates — re-running the workflow must never silently green a previously-failing check. **Use the `Write` tool** — bash output redirection (`>`, `>>`) is blocked by the sandbox even for paths inside the workspace, so `printf > $RESULT_PATH` will fail.
 
 ## Bash sandbox notes
 

From e00ffe59eb341df80654f3ee5acf9458d627c9cb Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 19 May 2026 08:55:36 +0000
Subject: [PATCH 4/6] feat: flag missing projection + prefer reusing indexes
 over adding new ones
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two additions sourced from team feedback while reviewing this PR:

* **Missing projection pattern** (apify/apify-core#19762). Adds a new
  "Read without projection" entry to the patterns checklist, a 🟡
  medium severity bullet, and a Users.findOne example. Flags
  find/findOne/findOneAnd* reads that omit projection and waste IO +
  BSON parse time on multi-KB documents. Skips writes and aggregate
  (which shapes via $project).

* **Prefer reusing existing indexes.** From the Slack thread on this
  PR — busy collections already have dozens of indexes and the goal is
  to shrink that surface, not grow it. Rewrites the "Suggested action"
  template in step 4 with a 3-tier preference: (1) adjust the query to
  fit an existing index, (2) extend or replace an existing index, (3)
  only as a last resort recommend a new one, with explicit reasoning
  why (1) and (2) don't work.
---
 mongodb-query-index-check/prompts/review.md | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mongodb-query-index-check/prompts/review.md b/mongodb-query-index-check/prompts/review.md
index 96f13d0..c8d12a1 100644
--- a/mongodb-query-index-check/prompts/review.md
+++ b/mongodb-query-index-check/prompts/review.md
@@ -57,6 +57,7 @@ Reconstruct the before- and after-state of each MongoDB call and compare both ag
 
     Reason from the underlying principle for novel cases too: any operation that would be index-only on a non-sharded collection but suddenly needs document fetches on a sharded one is a candidate.
 9. **Query on a heavy-index collection without `hint`.** If the collection file in `$MONGO_INDEXES_DIR` has many `ensureIndex` calls (e.g. `actor_jobs.ts` for `Act2Runs` has 40+), the planner can spend tens of seconds evaluating candidate plans on cold caches. Recommend `hint: '<index name>'` to pin plan selection. The chosen index needs `name:` set; if it doesn't, recommend adding one first. 🟡 medium.
+10. **Read without `projection`.** `find()`, `findOne()`, and `findOneAnd*()` calls that omit a `projection` (either in the options object or via a chained `.project()`) return every field of every matched document — wasted IO, network bandwidth, and BSON parse time. Documents in `Users`, `Act2Runs`, etc. are easily 10–100 KB, and most callers only need a handful of fields. Flag any added/modified read that lacks a projection; suggest the smallest projection covering the fields actually consumed downstream (and just `{ _id: 1 }` for existence checks). Skip writes (`updateOne`, `deleteOne`, `bulkWrite`) and `aggregate` (which shapes via `$project` and is often a full transformation anyway). 🟡 medium.
 
 ## Concrete examples
 
@@ -66,6 +67,7 @@ Reconstruct the before- and after-state of each MongoDB call and compare both ag
 - **🟠 high** — `Act2Runs.find({ status })` on a collection sharded by `userId`. Fans out to every shard.
 - **🟠 high** — `Act2Runs.find({ userId, actId, removedAt: null }, { sort: { startedAt: -1 }, limit: 100, skip: 100_000 })` on a sharded collection without `readConcern: 'available'`. The `SHARDING_FILTER` stage fetches and orphan-checks the 100k skipped docs (~60s observed). Pass `readConcern: 'available'`, or refactor to cursor pagination using `startedAt < lastStartedAt`.
 - **🟡 medium** — `$or` with one indexed branch and one unindexed branch — flag the unindexed branch.
+- **🟡 medium** — `Users.findOne({ _id: id })` with no projection. Returns the full ~50 KB user document when the caller only reads `.username`. Suggest `{ projection: { username: 1 } }` (or `{ _id: 1 }` if it's an existence check).
 
 ## Severity classification
 
@@ -84,6 +86,7 @@ Apply this ESR-aware (Equality → Sort → Range) rubric. Pick the highest appl
   - Sort uses the index but the direction doesn't match (compound index would need a reverse scan).
   - Multiple `$or` branches and at least one has no usable index.
   - Query on a collection with **many overlapping indexes** lacks `hint:` — planner spends tens of seconds choosing a plan.
+  - Read returns the full document because `projection` was omitted — wastes IO and BSON parse time for collections with multi-KB documents.
 - 🟢 **low** — Stylistic / advisory:
   - Could tighten an existing `partialFilterExpression` to reduce index size.
   - Project to fewer fields to make this a covered query.
@@ -141,7 +144,7 @@ For each finding, point `path` and `line` at the **after-state RIGHT line** of t
 
     {{1–3 sentences describing the issue. Reference the specific filter fields and the specific existing index by its `fields + name` if it has a name. Be concrete.}}
 
-    **Suggested action**: {{Either point to an existing index that should be used and what to adjust in the query, or recommend adding a new index in `src/packages/mongo-indexes/src/<file>.ts`. Be specific about which fields and partial-filter expressions.}}
+    **Suggested action**: {{Prefer fixes that don't add new indexes — busy collections (`Users`, `Act2Runs`, …) already have dozens and the explicit goal is to shrink that surface, not grow it. In order of preference: (1) adjust the query to fit an existing index — add the missing prefix field, include the `partialFilterExpression` qualifier, drop the offending sort, switch to cursor pagination, add a `projection`, etc.; name the index by `name:` if it has one, else by key spec. (2) Extend or replace an existing index so it covers both this query and its current callers (e.g. add one field to a compound) — name the index to modify and the new shape. (3) Only as a last resort, recommend a new index in `src/packages/mongo-indexes/src/<file>.ts`, and explicitly say why (1) and (2) don't work.}}
 
 Where `{{SEVERITY_EMOJI}}` / `{{SEVERITY_WORD}}` is one of `🔴 critical`, `🟠 high`, `🟡 medium`, `🟢 low`.
 

From 2e0556165c06dd2dea88055d067958c5beff9c57 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 19 May 2026 09:16:59 +0000
Subject: [PATCH 5/6] review: fix ShardAwareCollection refs (mvolfik feedback)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* Real constructor signature: `new ShardAwareCollection<TSchema, TShardKeys>(collection, shardKeys[], openTelemetry?)`,
  used like `new ShardAwareCollection<ActorRun, 'userId' | '_id'>(this.Act2Runs, ['userId', '_id'], openTelemetry)`.
  Old version was hallucinated.
* Extend sparse-checkout to also fetch `src/packages/mongo-connection/src/shard_aware_collection`
  so Claude can `Read` the file directly. New `$SHARD_AWARE_COLLECTION_DIR` env var surfaced
  in the prompt; empty when the file isn't present on the runner.
* Drop the "don't flag `countDocuments`" exception — `ShardAwareCollection` actually exposes
  `approximateCountDocuments` (which wraps `readConcern: 'available'`), not `countDocuments`.
  Plain `.countDocuments()` on a sharded raw collection is suspect and should be flagged.
* Step 1: loosen the "don't read outside `$MONGO_INDEXES_DIR`" rule to allow reading
  `$SHARD_AWARE_COLLECTION_DIR` and grepping the workspace for `new ShardAwareCollection<`.
---
 mongodb-query-index-check/action.yaml       | 20 ++++++++++++++++----
 mongodb-query-index-check/prompts/review.md |  8 +++++---
 2 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/mongodb-query-index-check/action.yaml b/mongodb-query-index-check/action.yaml
index ca11485..6a32ba8 100644
--- a/mongodb-query-index-check/action.yaml
+++ b/mongodb-query-index-check/action.yaml
@@ -107,7 +107,9 @@ runs:
         ref: develop
         token: ${{ inputs.apify-core-token }}
         path: __mongo_index_check_apify_core
-        sparse-checkout: src/packages/mongo-indexes/src
+        sparse-checkout: |
+          src/packages/mongo-indexes/src
+          src/packages/mongo-connection/src/shard_aware_collection
         sparse-checkout-cone-mode: false
         fetch-depth: 1
 
@@ -119,12 +121,14 @@ runs:
       run: |
         set -euo pipefail
         if [ -n "$APIFY_CORE_TOKEN" ]; then
-          indexes_dir="${GITHUB_WORKSPACE}/__mongo_index_check_apify_core/src/packages/mongo-indexes/src"
+          repo_root="${GITHUB_WORKSPACE}/__mongo_index_check_apify_core"
           origin_label="apify-core@develop"
         else
-          indexes_dir="${GITHUB_WORKSPACE}/src/packages/mongo-indexes/src"
+          repo_root="${GITHUB_WORKSPACE}"
           origin_label="local workspace (assuming caller is apify-core)"
         fi
+        indexes_dir="${repo_root}/src/packages/mongo-indexes/src"
+        shard_aware_dir="${repo_root}/src/packages/mongo-connection/src/shard_aware_collection"
         if [ ! -d "$indexes_dir" ]; then
           echo "::error::Could not find mongo-indexes source directory at: $indexes_dir"
           exit 1
@@ -132,6 +136,14 @@ runs:
         file_count=$(find "$indexes_dir" -maxdepth 2 -type f -name '*.ts' | wc -l)
         echo "Resolved mongo-indexes from ${origin_label}: ${indexes_dir} (${file_count} .ts file(s))."
         echo "MONGO_INDEXES_DIR=${indexes_dir}" >> "$GITHUB_ENV"
+        # Optional: surface the ShardAwareCollection source so Claude can verify shard keys directly.
+        # Skipped silently if absent (older apify-core ref, partial sparse checkout, etc.).
+        if [ -d "$shard_aware_dir" ]; then
+          echo "Resolved ShardAwareCollection sources at ${shard_aware_dir}."
+          echo "SHARD_AWARE_COLLECTION_DIR=${shard_aware_dir}" >> "$GITHUB_ENV"
+        else
+          echo "SHARD_AWARE_COLLECTION_DIR=" >> "$GITHUB_ENV"
+        fi
 
     - name: Render Claude prompt
       id: render
@@ -146,7 +158,7 @@ runs:
       run: |
         set -euo pipefail
         prompt_file="${RUNNER_TEMP}/mongo-index-prompt.md"
-        envsubst '$PR_NUMBER $REPO $BASE_SHA $HEAD_SHA $MONGO_INDEXES_DIR $CHANGED_FILES_PATH $RESULT_PATH $REQUEST_CHANGES_MODE' \
+        envsubst '$PR_NUMBER $REPO $BASE_SHA $HEAD_SHA $MONGO_INDEXES_DIR $SHARD_AWARE_COLLECTION_DIR $CHANGED_FILES_PATH $RESULT_PATH $REQUEST_CHANGES_MODE' \
           < "${GITHUB_ACTION_PATH}/prompts/review.md" \
           > "$prompt_file"
         delimiter="EOF_${RANDOM}${RANDOM}${RANDOM}"
diff --git a/mongodb-query-index-check/prompts/review.md b/mongodb-query-index-check/prompts/review.md
index c8d12a1..8a240ff 100644
--- a/mongodb-query-index-check/prompts/review.md
+++ b/mongodb-query-index-check/prompts/review.md
@@ -16,6 +16,7 @@ You are NOT here to do a general code review. Stay focused on indexes.
   - Files inside it are TypeScript sources, one per collection (e.g. `users.ts`, `actor_jobs.ts`, `request_queues.ts`).
   - Each file calls `await ensureIndex(Collection, { fields... }, { options... })`. Read these files to learn what indexes exist for each collection, including their `partialFilterExpression`, `unique`, `sparse`, `collation`, and `name` options.
   - The `index.ts` and `ensure_index.ts` files are infrastructure, not index definitions — skip them.
+- **ShardAwareCollection sources** (canonical source for shard keys): `$SHARD_AWARE_COLLECTION_DIR` (may be empty if unavailable on this runner). When set, contains `shard_aware_collection.ts` and `filter_types.ts` from `apify/apify-core`'s `mongo-connection` package. Read the file when you need to confirm whether a collection is sharded and on which keys — `grep` for `new ShardAwareCollection<` in the apify-core source separately to find usages.
 - **Changed source files in this PR** (JSON array of paths, already filtered to source files outside test/build/vendor dirs): `$CHANGED_FILES_PATH`
 - **Request-changes mode**: `$REQUEST_CHANGES_MODE`. When `true`, the workflow will fail on any finding; when `false`, comments only.
 - **Result file** you must write the maximum severity to: `$RESULT_PATH`. Allowed contents: one of `none`, `low`, `medium`, `high`, `critical` — single word, no whitespace, no trailing newline.
@@ -45,14 +46,14 @@ Reconstruct the before- and after-state of each MongoDB call and compare both ag
 
 1. **Partial-filter qualifier dropped or rewritten.** If a candidate index has `partialFilterExpression: { removedAt: null }`, the after-query must include every key. Removing or rewriting it (e.g. `removedAt: null` → `disabledAt: { $ne: null }`) makes the index unusable.
 2. **New `sort` outside the chosen index's ESR position.** Sort field must follow all equality fields in the index, else in-memory sort.
-3. **Sharded collection query omits the shard key from its filter / `$match`.** Sharded queries should include the shard key (in the `find()` filter, the filter argument of `findOneAnd*` / `updateOne` / `deleteOne` / etc., or the first `$match` of an `aggregate`) — without it, MongoDB fans the query out to every shard. The canonical source for the shard key is the `ShardAwareCollection(<name>, { shardKey: '<key>', ... })` constructor in the collection wrapper, which isn't visible from `$MONGO_INDEXES_DIR`. The reliable signal you *do* have: a `// For sharding <Collection>` comment in the index file, usually next to a unique index whose first field is the shard key. Some sharded collections (e.g. `Datasets`, sharded on `_id`) won't have that signal — when in doubt about sharding status, prefer not to flag.
+3. **Sharded collection query omits the shard key from its filter / `$match`.** Sharded queries should include the shard key (in the `find()` filter, the filter argument of `findOneAnd*` / `updateOne` / `deleteOne` / etc., or the first `$match` of an `aggregate`) — without it, MongoDB fans the query out to every shard. The canonical source for shard keys is the `ShardAwareCollection` class — its constructor signature is `new ShardAwareCollection<TSchema, TShardKeys>(collection, shardKeys, openTelemetry?)`, called like `new ShardAwareCollection<ActorRun, 'userId' | '_id'>(this.Act2Runs, ['userId', '_id'], openTelemetry)` (see `$SHARD_AWARE_COLLECTION_DIR/shard_aware_collection.ts` when that env var is set). Grep `new ShardAwareCollection<` across the apify-core source to find usages and read off the `shardKeys` array; the secondary signal in `$MONGO_INDEXES_DIR` is a `// For sharding <Collection>` comment next to a unique index. When you can't confirm sharding from either source, prefer not to flag.
 4. **New or modified `$or` branch.** Each branch is matched independently. Flag any branch without an index.
 5. **Equality → negation** (`$ne`, `$nin`, `$not`, `$exists: false`). Forces a scan even on indexed fields.
 6. **Unanchored `$regex` on an indexed field.** Only `/^prefix/` uses the index.
 7. **Range filter before equality in a compound query.** ESR order: equality, sort, range. Reversing neutralises the index.
 8. **Sharded-collection operation that should pass `readConcern: 'available'`.** Sharded chunks can contain orphan documents (residual from chunk migrations). For operations that would normally be served from the index alone, MongoDB has to fetch every candidate document and run a `SHARDING_FILTER` stage to drop orphans — expensive at scale. Passing `readConcern: 'available'` bypasses that filter at the cost of possibly including a few orphans (fine for most read paths). Known examples:
     - **`.skip(N)` / `$skip: N`** — must fetch every skipped doc for the orphan check; measured ~60s on `Act2Runs` at skip=100k. Suggest `readConcern: 'available'` or cursor pagination using the sort key (e.g. `startedAt < lastStartedAt` for a `sort: { startedAt: -1 }` cursor).
-    - **`.countDocuments()`** — same problem, but **don't flag**: `ShardAwareCollection` already wraps it with `readConcern: 'available'`.
+    - **Count operations** — `ShardAwareCollection.approximateCountDocuments()` wraps `readConcern: 'available'` already, so **don't flag it**. Plain `.countDocuments()` on the underlying `Collection` (or via `.rawCollection`) on a sharded collection does *not* — flag those.
     - **`aggregate` stages that traverse a wide index range without narrowing down** (`$count`, `$group` over many keys) — likely same problem.
 
     Reason from the underlying principle for novel cases too: any operation that would be index-only on a non-sharded collection but suddenly needs document fetches on a sharded one is a candidate.
@@ -100,7 +101,8 @@ Follow these steps in order:
 
 - Read `$CHANGED_FILES_PATH` (a JSON array of file paths). Use `Read` or `cat`.
 - For each file, fetch its diff. Prefer the GitHub MCP server (e.g. `mcp__github__pull_request_read` with `get_diff` / `get_files`, passing owner+repo from `$REPO` and `pull_number: $PR_NUMBER`). If that tool isn't available, fall back to `gh pr diff $PR_NUMBER --repo $REPO -- <path>` via Bash. Focus on diff hunks that contain at least one added line — in those hunks, the after-state (added lines plus their unchanged context) is the surface to analyze; ignore hunks that contain only deletions.
-- For each MongoDB call you find, identify the collection and read `$MONGO_INDEXES_DIR/<collection-snake-case>.ts` to learn the available indexes. Use `Read` (preferred) or `cat` / `grep` / `find` on `$MONGO_INDEXES_DIR/*`. Don't read anything outside `$MONGO_INDEXES_DIR`.
+- For each MongoDB call you find, identify the collection and read `$MONGO_INDEXES_DIR/<collection-snake-case>.ts` to learn the available indexes. Use `Read` (preferred) or `cat` / `grep` / `find`.
+- When you need to confirm whether a collection is sharded (or on which keys), `Read` `$SHARD_AWARE_COLLECTION_DIR/shard_aware_collection.ts` if `$SHARD_AWARE_COLLECTION_DIR` is non-empty, and grep the workspace for `new ShardAwareCollection<` to find collection-specific instantiations.
 
 ### 2. Decide findings
 

From e7b8663e47e693c9d176147aea08d79e8f157d31 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 19 May 2026 09:21:58 +0000
Subject: [PATCH 6/6] review: drop the shard_aware_collection sparse-checkout
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Prior commit extended the sparse-checkout to pull
`mongo-connection/src/shard_aware_collection`. That doesn't actually help
— the file just defines the class. The *instantiations*
(`new ShardAwareCollection<ActorRun, 'userId' | '_id'>(...)`) live at
call sites scattered across apify-core, which Claude needs to grep for
anyway.

Revert the action.yaml additions (sparse-checkout entry, env var,
envsubst placeholder) and the prompt's Input-data bullet. The Pattern #3
description keeps the constructor signature inline (it's stable enough
to bake into the prompt) and step 1 still tells Claude to grep the
workspace for `new ShardAwareCollection<` to find shard keys.
---
 mongodb-query-index-check/action.yaml       | 20 ++++----------------
 mongodb-query-index-check/prompts/review.md |  5 ++---
 2 files changed, 6 insertions(+), 19 deletions(-)

diff --git a/mongodb-query-index-check/action.yaml b/mongodb-query-index-check/action.yaml
index 6a32ba8..ca11485 100644
--- a/mongodb-query-index-check/action.yaml
+++ b/mongodb-query-index-check/action.yaml
@@ -107,9 +107,7 @@ runs:
         ref: develop
         token: ${{ inputs.apify-core-token }}
         path: __mongo_index_check_apify_core
-        sparse-checkout: |
-          src/packages/mongo-indexes/src
-          src/packages/mongo-connection/src/shard_aware_collection
+        sparse-checkout: src/packages/mongo-indexes/src
         sparse-checkout-cone-mode: false
         fetch-depth: 1
 
@@ -121,14 +119,12 @@ runs:
       run: |
         set -euo pipefail
         if [ -n "$APIFY_CORE_TOKEN" ]; then
-          repo_root="${GITHUB_WORKSPACE}/__mongo_index_check_apify_core"
+          indexes_dir="${GITHUB_WORKSPACE}/__mongo_index_check_apify_core/src/packages/mongo-indexes/src"
           origin_label="apify-core@develop"
         else
-          repo_root="${GITHUB_WORKSPACE}"
+          indexes_dir="${GITHUB_WORKSPACE}/src/packages/mongo-indexes/src"
           origin_label="local workspace (assuming caller is apify-core)"
         fi
-        indexes_dir="${repo_root}/src/packages/mongo-indexes/src"
-        shard_aware_dir="${repo_root}/src/packages/mongo-connection/src/shard_aware_collection"
         if [ ! -d "$indexes_dir" ]; then
           echo "::error::Could not find mongo-indexes source directory at: $indexes_dir"
           exit 1
@@ -136,14 +132,6 @@ runs:
         file_count=$(find "$indexes_dir" -maxdepth 2 -type f -name '*.ts' | wc -l)
         echo "Resolved mongo-indexes from ${origin_label}: ${indexes_dir} (${file_count} .ts file(s))."
         echo "MONGO_INDEXES_DIR=${indexes_dir}" >> "$GITHUB_ENV"
-        # Optional: surface the ShardAwareCollection source so Claude can verify shard keys directly.
-        # Skipped silently if absent (older apify-core ref, partial sparse checkout, etc.).
-        if [ -d "$shard_aware_dir" ]; then
-          echo "Resolved ShardAwareCollection sources at ${shard_aware_dir}."
-          echo "SHARD_AWARE_COLLECTION_DIR=${shard_aware_dir}" >> "$GITHUB_ENV"
-        else
-          echo "SHARD_AWARE_COLLECTION_DIR=" >> "$GITHUB_ENV"
-        fi
 
     - name: Render Claude prompt
       id: render
@@ -158,7 +146,7 @@ runs:
       run: |
         set -euo pipefail
         prompt_file="${RUNNER_TEMP}/mongo-index-prompt.md"
-        envsubst '$PR_NUMBER $REPO $BASE_SHA $HEAD_SHA $MONGO_INDEXES_DIR $SHARD_AWARE_COLLECTION_DIR $CHANGED_FILES_PATH $RESULT_PATH $REQUEST_CHANGES_MODE' \
+        envsubst '$PR_NUMBER $REPO $BASE_SHA $HEAD_SHA $MONGO_INDEXES_DIR $CHANGED_FILES_PATH $RESULT_PATH $REQUEST_CHANGES_MODE' \
           < "${GITHUB_ACTION_PATH}/prompts/review.md" \
           > "$prompt_file"
         delimiter="EOF_${RANDOM}${RANDOM}${RANDOM}"
diff --git a/mongodb-query-index-check/prompts/review.md b/mongodb-query-index-check/prompts/review.md
index 8a240ff..3fbfa31 100644
--- a/mongodb-query-index-check/prompts/review.md
+++ b/mongodb-query-index-check/prompts/review.md
@@ -16,7 +16,6 @@ You are NOT here to do a general code review. Stay focused on indexes.
   - Files inside it are TypeScript sources, one per collection (e.g. `users.ts`, `actor_jobs.ts`, `request_queues.ts`).
   - Each file calls `await ensureIndex(Collection, { fields... }, { options... })`. Read these files to learn what indexes exist for each collection, including their `partialFilterExpression`, `unique`, `sparse`, `collation`, and `name` options.
   - The `index.ts` and `ensure_index.ts` files are infrastructure, not index definitions — skip them.
-- **ShardAwareCollection sources** (canonical source for shard keys): `$SHARD_AWARE_COLLECTION_DIR` (may be empty if unavailable on this runner). When set, contains `shard_aware_collection.ts` and `filter_types.ts` from `apify/apify-core`'s `mongo-connection` package. Read the file when you need to confirm whether a collection is sharded and on which keys — `grep` for `new ShardAwareCollection<` in the apify-core source separately to find usages.
 - **Changed source files in this PR** (JSON array of paths, already filtered to source files outside test/build/vendor dirs): `$CHANGED_FILES_PATH`
 - **Request-changes mode**: `$REQUEST_CHANGES_MODE`. When `true`, the workflow will fail on any finding; when `false`, comments only.
 - **Result file** you must write the maximum severity to: `$RESULT_PATH`. Allowed contents: one of `none`, `low`, `medium`, `high`, `critical` — single word, no whitespace, no trailing newline.
@@ -46,7 +45,7 @@ Reconstruct the before- and after-state of each MongoDB call and compare both ag
 
 1. **Partial-filter qualifier dropped or rewritten.** If a candidate index has `partialFilterExpression: { removedAt: null }`, the after-query must include every key. Removing or rewriting it (e.g. `removedAt: null` → `disabledAt: { $ne: null }`) makes the index unusable.
 2. **New `sort` outside the chosen index's ESR position.** Sort field must follow all equality fields in the index, else in-memory sort.
-3. **Sharded collection query omits the shard key from its filter / `$match`.** Sharded queries should include the shard key (in the `find()` filter, the filter argument of `findOneAnd*` / `updateOne` / `deleteOne` / etc., or the first `$match` of an `aggregate`) — without it, MongoDB fans the query out to every shard. The canonical source for shard keys is the `ShardAwareCollection` class — its constructor signature is `new ShardAwareCollection<TSchema, TShardKeys>(collection, shardKeys, openTelemetry?)`, called like `new ShardAwareCollection<ActorRun, 'userId' | '_id'>(this.Act2Runs, ['userId', '_id'], openTelemetry)` (see `$SHARD_AWARE_COLLECTION_DIR/shard_aware_collection.ts` when that env var is set). Grep `new ShardAwareCollection<` across the apify-core source to find usages and read off the `shardKeys` array; the secondary signal in `$MONGO_INDEXES_DIR` is a `// For sharding <Collection>` comment next to a unique index. When you can't confirm sharding from either source, prefer not to flag.
+3. **Sharded collection query omits the shard key from its filter / `$match`.** Sharded queries should include the shard key (in the `find()` filter, the filter argument of `findOneAnd*` / `updateOne` / `deleteOne` / etc., or the first `$match` of an `aggregate`) — without it, MongoDB fans the query out to every shard. The canonical source for shard keys is the `ShardAwareCollection` class in `apify/apify-core`'s `mongo-connection` package — its constructor takes the shard keys as the second positional argument: `new ShardAwareCollection<TSchema, TShardKeys>(collection, shardKeys, openTelemetry?)`, used like `new ShardAwareCollection<ActorRun, 'userId' | '_id'>(this.Act2Runs, ['userId', '_id'], openTelemetry)` — meaning `Act2Runs` is sharded on `[userId, _id]`. To find shard keys for a collection, grep the workspace for `new ShardAwareCollection<` and read off the array literal at the matching usage. Secondary signal: a `// For sharding <Collection>` comment in the matching `$MONGO_INDEXES_DIR/<file>.ts`. When neither source confirms sharding, prefer not to flag.
 4. **New or modified `$or` branch.** Each branch is matched independently. Flag any branch without an index.
 5. **Equality → negation** (`$ne`, `$nin`, `$not`, `$exists: false`). Forces a scan even on indexed fields.
 6. **Unanchored `$regex` on an indexed field.** Only `/^prefix/` uses the index.
@@ -102,7 +101,7 @@ Follow these steps in order:
 - Read `$CHANGED_FILES_PATH` (a JSON array of file paths). Use `Read` or `cat`.
 - For each file, fetch its diff. Prefer the GitHub MCP server (e.g. `mcp__github__pull_request_read` with `get_diff` / `get_files`, passing owner+repo from `$REPO` and `pull_number: $PR_NUMBER`). If that tool isn't available, fall back to `gh pr diff $PR_NUMBER --repo $REPO -- <path>` via Bash. Focus on diff hunks that contain at least one added line — in those hunks, the after-state (added lines plus their unchanged context) is the surface to analyze; ignore hunks that contain only deletions.
 - For each MongoDB call you find, identify the collection and read `$MONGO_INDEXES_DIR/<collection-snake-case>.ts` to learn the available indexes. Use `Read` (preferred) or `cat` / `grep` / `find`.
-- When you need to confirm whether a collection is sharded (or on which keys), `Read` `$SHARD_AWARE_COLLECTION_DIR/shard_aware_collection.ts` if `$SHARD_AWARE_COLLECTION_DIR` is non-empty, and grep the workspace for `new ShardAwareCollection<` to find collection-specific instantiations.
+- When you need to confirm whether a collection is sharded (or on which keys), grep the workspace for `new ShardAwareCollection<` and read off the shard-key array from the matching usage.
 
 ### 2. Decide findings