Add low-latency raw memory search mode#195
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f0a4ce247c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| cached = self._profile_catalog_cache.get(user_id) | ||
| if cached and now - cached[0] <= PROFILE_CATALOG_CACHE_TTL_SECONDS: | ||
| return cached[1], cached[2] |
There was a problem hiding this comment.
Invalidate profile catalog cache on fresh memory writes
This cache returns previously fetched profile records for up to 60 seconds without any ingest-triggered invalidation, so a user who ingests new profile facts and immediately calls /v1/memory/retrieve or /v1/memory/search can receive stale results that omit the just-written memory. Because _search_profile and raw profile search both reuse these cached records, this directly regresses read-after-write correctness for normal product flows.
Useful? React with 👍 / 👎.
| for entry in profile_catalog | ||
| ) | ||
| ) | ||
| return (user_id, query.strip().lower(), top_k, catalog_key) |
There was a problem hiding this comment.
Expand retrieval-plan cache key to reflect memory changes
The retrieval-plan cache key only includes user_id, normalized query, top_k, and profile catalog topics/subtopics, so repeated queries within TTL will reuse old tool calls even after non-profile memory changes (e.g., new temporal/summary/snippet data). In that case run() skips tool-selection LLM and can continue querying the wrong domains, producing stale or incomplete answers until the cache expires.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request enhances the memory search functionality by introducing a "code" domain and optional LLM-synthesized answers. Key changes include refactoring the RetrievalPipeline to support raw_search and synthesize_answer methods, implementing caching for profile catalogs and retrieval plans, and adding a latency tracking system with Prometheus metrics. Feedback suggests addressing a potential memory leak in the profile cache, improving encapsulation by avoiding protected method calls, and ensuring that search results are truncated to respect top_k limits and LLM context window constraints.
| self._profile_catalog_cache: Dict[ | ||
| str, Tuple[float, List[Dict[str, str]], List[Any]] | ||
| ] = {} |
There was a problem hiding this comment.
The _profile_catalog_cache is a dictionary keyed by user_id that grows indefinitely as new users perform searches. Since RetrievalPipeline is a long-lived singleton, this could lead to a memory leak in a production environment with many users. Consider using an LRU cache or setting a maximum size for this dictionary.
| code_results = await _search_code( | ||
| query=req.query, | ||
| user_id=user_id, | ||
| org_id=req.org_id, | ||
| repo=req.repo or "", | ||
| top_k=req.top_k, | ||
| ) |
There was a problem hiding this comment.
|
|
||
| if req.answer: | ||
| answer_start = time.perf_counter() | ||
| answer = await pipeline.synthesize_answer(req.query, all_results) |
There was a problem hiding this comment.
When req.answer is true, the all_results list is passed to synthesize_answer without any truncation. If the combined results from memory and code domains are large, this could exceed the LLM's context window. Consider truncating all_results to the top k hits before synthesis.
References
- Ensure that context provided to LLMs is within reasonable token limits to avoid truncation or API errors.
| memory_results, raw_latency = await pipeline.raw_search( | ||
| query=req.query, | ||
| user_id=user_id, | ||
| domains=memory_domains, | ||
| top_k=req.top_k, | ||
| ) |
There was a problem hiding this comment.
The raw_search method returns up to top_k results per domain. With multiple domains selected, the final all_results list can contain significantly more than top_k items. If the API consumer expects at most top_k results total, the final list should be truncated after sorting.
References
- Ensure search results respect the requested limit (top_k) across all aggregated domains.
|
Follow-up pushed in e555bea:
Validation:
GitHub currently reports 1 check on the new head commit, with 0 failed and 0 pending. |
|
Followed up on the remaining review note about the memory route calling the protected code retrieval dispatcher directly. Changes in
Validation run locally:
|
|
| Filename | Overview |
|---|---|
| src/api/routes/memory.py | Rewrites search_memory to use raw_search + optional answer synthesis; adds code domain support with graceful degradation; removes old per-domain helper functions; adds cache invalidation on ingest. Dict merge order in _search_code may silently override source_domain. |
| src/pipelines/retrieval.py | Adds raw_search, synthesize_answer, record_latency, and cache invalidation methods; introduces profile catalog LRU cache and retrieval plan LRU cache keyed by user/query/version (top_k excluded). Implementation looks correct. |
| src/pipelines/code_retrieval.py | Adds raw_search that concurrently calls search_symbols and search_files then fuses via RRF. asyncio.gather without return_exceptions=True means a single-tool failure silently discards partial results from the succeeding tool. |
| src/api/schemas.py | Adds answer, confidence, mode, and latency fields to SearchResponse; adds snippet and code to allowed domains; adds org_id, repo, and answer fields to SearchRequest. Formatting-only changes elsewhere. |
| tests/api/test_dependencies_and_routes.py | Adds tests for _search_code delegation, code search failure preserving memory results, and answer synthesis failure preserving raw results. Coverage matches the new graceful-degradation paths. |
| tests/integration/test_retrieval_pipeline.py | Adds integration tests for raw_search (all domains, top-k limiting, profile cache invalidation) and retrieval plan caching (LLM skipped on cache hit, invalidated on user data change). |
| tests/integration/test_code_retrieval_pipeline.py | Adds tests for CodeRetrievalPipeline.raw_search: verifies both tools are called, deduplication via _fuse_source_records, and hard limit enforcement. |
Sequence Diagram
sequenceDiagram
participant Client
participant SearchEndpoint as POST /v1/memory/search
participant RetrievalPipeline as RetrievalPipeline.raw_search
participant CodePipeline as CodeRetrievalPipeline.raw_search
participant Synthesizer as pipeline.synthesize_answer
Client->>SearchEndpoint: "{query, domains, top_k, answer?, org_id?}"
SearchEndpoint->>SearchEndpoint: validate org_id if code in domains
alt memory_domains non-empty
SearchEndpoint->>RetrievalPipeline: raw_search(query, user_id, domains, top_k)
RetrievalPipeline->>RetrievalPipeline: profile (from catalog cache)
par concurrent
RetrievalPipeline->>RetrievalPipeline: temporal search
and
RetrievalPipeline->>RetrievalPipeline: summary search
and
RetrievalPipeline->>RetrievalPipeline: snippet search
end
RetrievalPipeline-->>SearchEndpoint: "(records[], latency{mode,current_ms})"
end
alt code in domains
SearchEndpoint->>CodePipeline: raw_search(query, user_id, repo, top_k)
par concurrent
CodePipeline->>CodePipeline: _execute_tool(search_symbols)
and
CodePipeline->>CodePipeline: _execute_tool(search_files)
end
CodePipeline->>CodePipeline: _fuse_source_records(RRF dedupe)
CodePipeline-->>SearchEndpoint: SourceRecord[]
end
SearchEndpoint->>SearchEndpoint: sort and slice to top_k
alt "req.answer == true"
SearchEndpoint->>Synthesizer: synthesize_answer(query, all_results)
Synthesizer->>Synthesizer: _format_tool_results + LLM
Synthesizer-->>SearchEndpoint: answer string
end
SearchEndpoint-->>Client: "SearchResponse{results, answer, mode, latency}"
Reviews (6): Last reviewed commit: "Preserve raw results on answer synthesis..." | Re-trigger Greptile
8f0591e to
c424928
Compare
|
Rebased this branch onto current Local focused validation on
The GitHub label check is green; Greptile review is still running on the new head. |
|
Follow-up pushed in
Validation:
|
|
Follow-up pushed in
Validation:
|
|
Follow-up pushed in
Validation:
|
|
Follow-up pushed in
Validation:
|
Closes #163.
Summary
/v1/memory/searchonto a low-latency raw retrieval path that skips LLM tool selection and synthesis by defaultanswer=trueto synthesize from already-retrieved raw hits when callers need the slower answer path/retrievepathValidation
\.venv\Scripts\python.exe -m pytest tests/integration/test_retrieval_pipeline.py tests/api/test_dependencies_and_routes.py tests/unit/test_schemas.py\.venv\Scripts\python.exe -m ruff check src/pipelines/retrieval.py src/api/schemas.py src/api/routes/memory.py tests/integration/test_retrieval_pipeline.py\.venv\Scripts\python.exe -m black --target-version py311 --check src/pipelines/retrieval.py src/api/schemas.py src/api/routes/memory.py tests/integration/test_retrieval_pipeline.pygit diff --check