feat: Update of MongoDB index checker, prompt, deduplication, etc.#17
Conversation
Two new patterns added to the action's prompt, sourced from recent
internal post-mortems on apify-core slow queries:
1. **`.skip(N)` / `$skip` on a sharded collection without
`readConcern: 'available'`.** MongoDB runs a `SHARDING_FILTER` stage
that fetches every skipped document and checks it against the
shard's chunk ranges to filter orphans. For large skips this
dominates the query — measured ~60s on `Act2Runs` with skip=100k.
Fix: pass `readConcern: 'available'` (or refactor to cursor
pagination with `$gt: lastId`). Applies to `find().skip()`,
`find(..., { skip: N })`, and `aggregate([{ $skip: N }, …])`.
`countDocuments` is auto-handled by `ShardAwareCollection` and
should not be flagged.
2. **Query on a heavy-index collection without `hint`.** When the
collection's `mongo-indexes` file has many `ensureIndex` calls
(e.g. `actor_jobs.ts` for `Act2Runs` has 40+), the planner can
spend tens of seconds choosing a plan. Recommend
`hint: '<index name>'` to pin selection. 🟡 medium.
Includes one concrete example for #1 (the 100k-skip on Act2Runs case
from the post-mortem). The severity rubric gains corresponding entries
under 🟠 high (sharded-skip) and 🟡 medium (missing hint).
There was a problem hiding this comment.
Pull request overview
Updates the MongoDB query/index review prompt used by mongodb-query-index-check to catch two additional slow-query patterns seen in post-mortems, improving the action’s ability to flag sharding- and planner-related regressions.
Changes:
- Add a new “sharded skip without
readConcern: 'available'” pattern plus a concrete high-severity example. - Add a new “heavy-index collection without
hint:” pattern and corresponding severity rubric guidance. - Extend the severity rubric to classify both patterns (high for sharded skip, medium for missing hint on many-index collections).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
this instruction is quite unclear tbh.. At first, I read it as "flag when somebody tries to remove the unique index equivalent to the shard key" - "dropping it fans out to every shard" sounds like dropping the index. Imo the instruction should be more about "queries to sharded collections should include the shard key (in filter for find() and similar operations, or in $match stage in aggregations) whenever possible, to avoid sending queries to all shards", or something in that sense
Also, the canonical info about a shard key is the parameter to ShardAwareCollection constructor, the index is more of a side effect (and e.g. for datasets, I don't think we'll have an explicit index like this, because we'll shard just by _id).
Two prompt rewrites prompted by inline review on #17: * **Pattern #3 (shard-key dropped).** Old wording read ambiguously as "flag when somebody drops the unique index" rather than "flag when a query omits the shard-key field". Rewritten to make the intent explicit (drop = field missing from filter / `$match`), and to name the canonical source for the shard key — the `ShardAwareCollection(<name>, { shardKey: '<key>', … })` constructor, which lives outside `$MONGO_INDEXES_DIR`. The `// For sharding …` index comment is now described as a signal rather than the source. Adds the explicit caveat that some sharded collections (e.g. `Datasets`, sharded on `_id`) won't have that signal at all and should be left un-flagged when the status is uncertain. * **Pattern #8 (sharded skip / readConcern).** Rewritten from "match these specific calls" to "here's the underlying mechanism, here are known examples, reason about new cases too": sharded chunks can hold orphan docs → `SHARDING_FILTER` stage fetches every candidate to drop them → `readConcern: 'available'` bypasses it. The named examples (skip, countDocuments-auto-handled, wide aggregation ranges) now sit under that explanation rather than being the rule itself, which lets the model spot novel index-only-on-shard regressions instead of just the patterns we already enumerated. Cursor-pagination guidance now consistently uses the sort key (`startedAt < lastStartedAt`), matching the concrete example below.
Co-authored-by: Claude <noreply@anthropic.com>
…w ones Two additions sourced from team feedback while reviewing this PR: * **Missing projection pattern** (apify/apify-core#19762). Adds a new "Read without projection" entry to the patterns checklist, a 🟡 medium severity bullet, and a Users.findOne example. Flags find/findOne/findOneAnd* reads that omit projection and waste IO + BSON parse time on multi-KB documents. Skips writes and aggregate (which shapes via $project). * **Prefer reusing existing indexes.** From the Slack thread on this PR — busy collections already have dozens of indexes and the goal is to shrink that surface, not grow it. Rewrites the "Suggested action" template in step 4 with a 3-tier preference: (1) adjust the query to fit an existing index, (2) extend or replace an existing index, (3) only as a last resort recommend a new one, with explicit reasoning why (1) and (2) don't work.
* Real constructor signature: `new ShardAwareCollection<TSchema, TShardKeys>(collection, shardKeys[], openTelemetry?)`, used like `new ShardAwareCollection<ActorRun, 'userId' | '_id'>(this.Act2Runs, ['userId', '_id'], openTelemetry)`. Old version was hallucinated. * Extend sparse-checkout to also fetch `src/packages/mongo-connection/src/shard_aware_collection` so Claude can `Read` the file directly. New `$SHARD_AWARE_COLLECTION_DIR` env var surfaced in the prompt; empty when the file isn't present on the runner. * Drop the "don't flag `countDocuments`" exception — `ShardAwareCollection` actually exposes `approximateCountDocuments` (which wraps `readConcern: 'available'`), not `countDocuments`. Plain `.countDocuments()` on a sharded raw collection is suspect and should be flagged. * Step 1: loosen the "don't read outside `$MONGO_INDEXES_DIR`" rule to allow reading `$SHARD_AWARE_COLLECTION_DIR` and grepping the workspace for `new ShardAwareCollection<`.
Prior commit extended the sparse-checkout to pull `mongo-connection/src/shard_aware_collection`. That doesn't actually help — the file just defines the class. The *instantiations* (`new ShardAwareCollection<ActorRun, 'userId' | '_id'>(...)`) live at call sites scattered across apify-core, which Claude needs to grep for anyway. Revert the action.yaml additions (sparse-checkout entry, env var, envsubst placeholder) and the prompt's Input-data bullet. The Pattern #3 description keeps the constructor signature inline (it's stable enough to bake into the prompt) and step 1 still tells Claude to grep the workspace for `new ShardAwareCollection<` to find shard keys.
🤖 I have created a release *beep* *boop* --- ## [1.2.0](v1.1.2...v1.2.0) (2026-05-19) ### Features * Update of MongoDB index checker, prompt, deduplication, etc. ([#17](#17)) ([86b91e4](86b91e4)) ### Bug Fixes * Update `signed-commit` action to fix issues with unstaged or conflicted files ([#22](#22)) ([5e8686c](5e8686c)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
What changed
Extensions to
prompts/review.mdformongodb-query-index-check, sourced from apify-core post-mortems and team feedback during review:ShardAwareCollectionconstructor signature (new ShardAwareCollection<TSchema, TShardKeys>(collection, shardKeys[], openTelemetry?)) and tells Claude to grep the workspace fornew ShardAwareCollection<to find shard keys per collection. Mechanism-first treatment ofreadConcern: 'available'so Claude can reason about novel cases.ShardAwareCollectionexposesapproximateCountDocuments(wrapped) — notcountDocuments, so plain.countDocuments()on a sharded raw collection IS flag-worthy.hint:.projection(wastes IO + BSON parse on multi-KB docs).(path, line)as prior bot comments before posting.Tests
pnpm run lint/type-check/test— clean, 35/35. envsubst allowlist unchanged.