Skip to content

perf(retrieval): memoize per-entry tokens on the entry itself#9

Merged
silversurfer562 merged 1 commit into
mainfrom
perf/precompute-entry-tokens
May 8, 2026
Merged

perf(retrieval): memoize per-entry tokens on the entry itself#9
silversurfer562 merged 1 commit into
mainfrom
perf/precompute-entry-tokens

Conversation

@silversurfer562
Copy link
Copy Markdown
Member

Code-review finding (retrieval.py:179). The keyword retriever re-tokenized every entry's path/summary/content/aliases on every query, plus looped over related entries' summaries. N×Q tokenizations for N entries × Q queries.

Add a _tokens_cache sidecar on RetrievalEntry (frozen + non-comparing/hashing/repr-ing) and route the scorer through two memoizing helpers. Cache is keyed by CONTENT_PREVIEW_CHARS and corpus.name so subclasses and cross-corpus lookups don't collide.

Three new tests. 306 passed (was 303).

🤖 Generated with Claude Code

Code-review 2026-05-07: the keyword retriever's ``_score_entry``
re-tokenized path / summary / content-preview / aliases — and looped
over related entries' summaries — on EVERY query against EVERY entry.
For a corpus of N entries answering Q queries, that's N*Q tokenization
passes, all redoing identical work.

Add a ``_tokens_cache`` field on :class:`RetrievalEntry` (frozen
dataclass; ``compare=False, hash=False, repr=False`` so identity
semantics are unchanged) and route the retriever through two
helpers:

- ``_entry_field_tokens`` — keyed by ``("field_tokens", CONTENT_PREVIEW_CHARS)``.
  Computes path/summary/content-preview/aliases once per entry per
  retriever-class. Subclasses with different preview sizes get
  independent cache slots automatically.
- ``_related_summary_tokens`` — keyed by ``("related_tokens", corpus.name)``.
  Computes the union of related-entry summary tokens once per
  (entry, corpus) pair. Cache lives on the entry, so when a corpus
  rebuilds (new entry instances), it's naturally fresh.

Three new tests cover: cache populated on first call (same dict
returned on subsequent calls), preview-size-keyed independence, and
the bottom line — repeated ``_score_entry`` calls against the same
entry don't re-tokenize the entry's fields.

306 passed (up from 303).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@silversurfer562 silversurfer562 merged commit 1eaa641 into main May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant