Skip to content

feat: Update of MongoDB index checker, prompt, deduplication, etc.#17

Merged
mtrunkat merged 6 commits into
mainfrom
claude/mongodb-query-index-check-sharding-patterns
May 19, 2026
Merged

feat: Update of MongoDB index checker, prompt, deduplication, etc.#17
mtrunkat merged 6 commits into
mainfrom
claude/mongodb-query-index-check-sharding-patterns

Conversation

@mtrunkat
Copy link
Copy Markdown
Member

@mtrunkat mtrunkat commented May 18, 2026

What changed

Extensions to prompts/review.md for mongodb-query-index-check, sourced from apify-core post-mortems and team feedback during review:

  • Sharded patterns. Pattern feat: add mongodb-query-index-check action #3 now bakes in the real ShardAwareCollection constructor signature (new ShardAwareCollection<TSchema, TShardKeys>(collection, shardKeys[], openTelemetry?)) and tells Claude to grep the workspace for new ShardAwareCollection< to find shard keys per collection. Mechanism-first treatment of readConcern: 'available' so Claude can reason about novel cases. ShardAwareCollection exposes approximateCountDocuments (wrapped) — not countDocuments, so plain .countDocuments() on a sharded raw collection IS flag-worthy.
  • Heavy-index hint (🟡 medium). Flag queries on 40+-index collections lacking hint:.
  • Missing projection (🟡 medium). Closes apify-core#19762. Flags reads without projection (wastes IO + BSON parse on multi-KB docs).
  • Prefer reusing existing indexes. Rewritten "Suggested action": (1) adjust the query, (2) extend an existing index, (3) only as last resort propose a new one. Per the Slack thread — busy collections already have too many indexes and the goal is to shrink that surface, not grow it.
  • Cross-run dedup (already merged via Claude/fix mongodb duplicates x v56n #18) — Claude drops findings at the same (path, line) as prior bot comments before posting.

Tests

pnpm run lint / type-check / test — clean, 35/35. envsubst allowlist unchanged.

Two new patterns added to the action's prompt, sourced from recent
internal post-mortems on apify-core slow queries:

1. **`.skip(N)` / `$skip` on a sharded collection without
   `readConcern: 'available'`.** MongoDB runs a `SHARDING_FILTER` stage
   that fetches every skipped document and checks it against the
   shard's chunk ranges to filter orphans. For large skips this
   dominates the query — measured ~60s on `Act2Runs` with skip=100k.
   Fix: pass `readConcern: 'available'` (or refactor to cursor
   pagination with `$gt: lastId`). Applies to `find().skip()`,
   `find(..., { skip: N })`, and `aggregate([{ $skip: N }, …])`.
   `countDocuments` is auto-handled by `ShardAwareCollection` and
   should not be flagged.

2. **Query on a heavy-index collection without `hint`.** When the
   collection's `mongo-indexes` file has many `ensureIndex` calls
   (e.g. `actor_jobs.ts` for `Act2Runs` has 40+), the planner can
   spend tens of seconds choosing a plan. Recommend
   `hint: '<index name>'` to pin selection. 🟡 medium.

Includes one concrete example for #1 (the 100k-skip on Act2Runs case
from the post-mortem). The severity rubric gains corresponding entries
under 🟠 high (sharded-skip) and 🟡 medium (missing hint).
Copilot AI review requested due to automatic review settings May 18, 2026 13:18
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the MongoDB query/index review prompt used by mongodb-query-index-check to catch two additional slow-query patterns seen in post-mortems, improving the action’s ability to flag sharding- and planner-related regressions.

Changes:

  • Add a new “sharded skip without readConcern: 'available'” pattern plus a concrete high-severity example.
  • Add a new “heavy-index collection without hint:” pattern and corresponding severity rubric guidance.
  • Extend the severity rubric to classify both patterns (high for sharded skip, medium for missing hint on many-index collections).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mongodb-query-index-check/prompts/review.md Outdated
Comment thread mongodb-query-index-check/prompts/review.md Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this instruction is quite unclear tbh.. At first, I read it as "flag when somebody tries to remove the unique index equivalent to the shard key" - "dropping it fans out to every shard" sounds like dropping the index. Imo the instruction should be more about "queries to sharded collections should include the shard key (in filter for find() and similar operations, or in $match stage in aggregations) whenever possible, to avoid sending queries to all shards", or something in that sense

Also, the canonical info about a shard key is the parameter to ShardAwareCollection constructor, the index is more of a side effect (and e.g. for datasets, I don't think we'll have an explicit index like this, because we'll shard just by _id).

Comment thread mongodb-query-index-check/prompts/review.md Outdated
Two prompt rewrites prompted by inline review on #17:

* **Pattern #3 (shard-key dropped).** Old wording read ambiguously as
  "flag when somebody drops the unique index" rather than "flag when a
  query omits the shard-key field". Rewritten to make the intent
  explicit (drop = field missing from filter / `$match`), and to name
  the canonical source for the shard key — the
  `ShardAwareCollection(<name>, { shardKey: '<key>', … })` constructor,
  which lives outside `$MONGO_INDEXES_DIR`. The `// For sharding …`
  index comment is now described as a signal rather than the source.
  Adds the explicit caveat that some sharded collections (e.g.
  `Datasets`, sharded on `_id`) won't have that signal at all and
  should be left un-flagged when the status is uncertain.

* **Pattern #8 (sharded skip / readConcern).** Rewritten from "match
  these specific calls" to "here's the underlying mechanism, here are
  known examples, reason about new cases too": sharded chunks can hold
  orphan docs → `SHARDING_FILTER` stage fetches every candidate to drop
  them → `readConcern: 'available'` bypasses it. The named examples
  (skip, countDocuments-auto-handled, wide aggregation ranges) now sit
  under that explanation rather than being the rule itself, which lets
  the model spot novel index-only-on-shard regressions instead of just
  the patterns we already enumerated. Cursor-pagination guidance now
  consistently uses the sort key (`startedAt < lastStartedAt`), matching
  the concrete example below.
Comment thread mongodb-query-index-check/prompts/review.md Outdated
mtrunkat and others added 2 commits May 19, 2026 10:49
Co-authored-by: Claude <noreply@anthropic.com>
…w ones

Two additions sourced from team feedback while reviewing this PR:

* **Missing projection pattern** (apify/apify-core#19762). Adds a new
  "Read without projection" entry to the patterns checklist, a 🟡
  medium severity bullet, and a Users.findOne example. Flags
  find/findOne/findOneAnd* reads that omit projection and waste IO +
  BSON parse time on multi-KB documents. Skips writes and aggregate
  (which shapes via $project).

* **Prefer reusing existing indexes.** From the Slack thread on this
  PR — busy collections already have dozens of indexes and the goal is
  to shrink that surface, not grow it. Rewrites the "Suggested action"
  template in step 4 with a 3-tier preference: (1) adjust the query to
  fit an existing index, (2) extend or replace an existing index, (3)
  only as a last resort recommend a new one, with explicit reasoning
  why (1) and (2) don't work.
@mtrunkat mtrunkat changed the title feat: teach the prompt to flag sharded-skip and missing-hint queries feat: Update of MongoDB index checker, prompt, deduplication, etc. May 19, 2026
claude added 2 commits May 19, 2026 09:16
* Real constructor signature: `new ShardAwareCollection<TSchema, TShardKeys>(collection, shardKeys[], openTelemetry?)`,
  used like `new ShardAwareCollection<ActorRun, 'userId' | '_id'>(this.Act2Runs, ['userId', '_id'], openTelemetry)`.
  Old version was hallucinated.
* Extend sparse-checkout to also fetch `src/packages/mongo-connection/src/shard_aware_collection`
  so Claude can `Read` the file directly. New `$SHARD_AWARE_COLLECTION_DIR` env var surfaced
  in the prompt; empty when the file isn't present on the runner.
* Drop the "don't flag `countDocuments`" exception — `ShardAwareCollection` actually exposes
  `approximateCountDocuments` (which wraps `readConcern: 'available'`), not `countDocuments`.
  Plain `.countDocuments()` on a sharded raw collection is suspect and should be flagged.
* Step 1: loosen the "don't read outside `$MONGO_INDEXES_DIR`" rule to allow reading
  `$SHARD_AWARE_COLLECTION_DIR` and grepping the workspace for `new ShardAwareCollection<`.
Prior commit extended the sparse-checkout to pull
`mongo-connection/src/shard_aware_collection`. That doesn't actually help
— the file just defines the class. The *instantiations*
(`new ShardAwareCollection<ActorRun, 'userId' | '_id'>(...)`) live at
call sites scattered across apify-core, which Claude needs to grep for
anyway.

Revert the action.yaml additions (sparse-checkout entry, env var,
envsubst placeholder) and the prompt's Input-data bullet. The Pattern #3
description keeps the constructor signature inline (it's stable enough
to bake into the prompt) and step 1 still tells Claude to grep the
workspace for `new ShardAwareCollection<` to find shard keys.
@mtrunkat mtrunkat merged commit 86b91e4 into main May 19, 2026
3 checks passed
@mtrunkat mtrunkat deleted the claude/mongodb-query-index-check-sharding-patterns branch May 19, 2026 09:30
fnesveda pushed a commit that referenced this pull request May 19, 2026
🤖 I have created a release *beep* *boop*
---


## [1.2.0](v1.1.2...v1.2.0)
(2026-05-19)


### Features

* Update of MongoDB index checker, prompt, deduplication, etc.
([#17](#17))
([86b91e4](86b91e4))


### Bug Fixes

* Update `signed-commit` action to fix issues with unstaged or
conflicted files ([#22](#22))
([5e8686c](5e8686c))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants