Skip to content

Add low-latency raw memory search mode#195

Open
ProtonsAndElectrons wants to merge 7 commits into
XortexAI:mainfrom
ProtonsAndElectrons:bounty-163-raw-search
Open

Add low-latency raw memory search mode#195
ProtonsAndElectrons wants to merge 7 commits into
XortexAI:mainfrom
ProtonsAndElectrons:bounty-163-raw-search

Conversation

@ProtonsAndElectrons
Copy link
Copy Markdown

Closes #163.

Summary

  • move /v1/memory/search onto a low-latency raw retrieval path that skips LLM tool selection and synthesis by default
  • include ranked profile, temporal, summary, snippet, and optional code-domain hits
  • add answer=true to synthesize from already-retrieved raw hits when callers need the slower answer path
  • cache profile catalogs and repeated retrieval plans for the agentic /retrieve path
  • report in-process p50/p95/p99 latency stats per retrieval mode

Validation

  • \.venv\Scripts\python.exe -m pytest tests/integration/test_retrieval_pipeline.py tests/api/test_dependencies_and_routes.py tests/unit/test_schemas.py
  • \.venv\Scripts\python.exe -m ruff check src/pipelines/retrieval.py src/api/schemas.py src/api/routes/memory.py tests/integration/test_retrieval_pipeline.py
  • \.venv\Scripts\python.exe -m black --target-version py311 --check src/pipelines/retrieval.py src/api/schemas.py src/api/routes/memory.py tests/integration/test_retrieval_pipeline.py
  • git diff --check

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f0a4ce247c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +753 to +755
cached = self._profile_catalog_cache.get(user_id)
if cached and now - cached[0] <= PROFILE_CATALOG_CACHE_TTL_SECONDS:
return cached[1], cached[2]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Invalidate profile catalog cache on fresh memory writes

This cache returns previously fetched profile records for up to 60 seconds without any ingest-triggered invalidation, so a user who ingests new profile facts and immediately calls /v1/memory/retrieve or /v1/memory/search can receive stale results that omit the just-written memory. Because _search_profile and raw profile search both reuse these cached records, this directly regresses read-after-write correctness for normal product flows.

Useful? React with 👍 / 👎.

Comment thread src/pipelines/retrieval.py Outdated
for entry in profile_catalog
)
)
return (user_id, query.strip().lower(), top_k, catalog_key)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Expand retrieval-plan cache key to reflect memory changes

The retrieval-plan cache key only includes user_id, normalized query, top_k, and profile catalog topics/subtopics, so repeated queries within TTL will reuse old tool calls even after non-profile memory changes (e.g., new temporal/summary/snippet data). In that case run() skips tool-selection LLM and can continue querying the wrong domains, producing stale or incomplete answers until the cache expires.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the memory search functionality by introducing a "code" domain and optional LLM-synthesized answers. Key changes include refactoring the RetrievalPipeline to support raw_search and synthesize_answer methods, implementing caching for profile catalogs and retrieval plans, and adding a latency tracking system with Prometheus metrics. Feedback suggests addressing a potential memory leak in the profile cache, improving encapsulation by avoiding protected method calls, and ensuring that search results are truncated to respect top_k limits and LLM context window constraints.

Comment thread src/pipelines/retrieval.py Outdated
Comment on lines +190 to +192
self._profile_catalog_cache: Dict[
str, Tuple[float, List[Dict[str, str]], List[Any]]
] = {}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The _profile_catalog_cache is a dictionary keyed by user_id that grows indefinitely as new users perform searches. Since RetrievalPipeline is a long-lived singleton, this could lead to a memory leak in a production environment with many users. Consider using an LRU cache or setting a maximum size for this dictionary.

Comment thread src/api/routes/memory.py Outdated
Comment on lines +995 to +1001
code_results = await _search_code(
query=req.query,
user_id=user_id,
org_id=req.org_id,
repo=req.repo or "",
top_k=req.top_k,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _search_code function directly calls the protected method _execute_tool on the CodeRetrievalPipeline. It is generally better to expose a public method for raw search in the pipeline class to maintain encapsulation, similar to how RetrievalPipeline has a raw_search method.

Comment thread src/api/routes/memory.py Outdated

if req.answer:
answer_start = time.perf_counter()
answer = await pipeline.synthesize_answer(req.query, all_results)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When req.answer is true, the all_results list is passed to synthesize_answer without any truncation. If the combined results from memory and code domains are large, this could exceed the LLM's context window. Consider truncating all_results to the top k hits before synthesis.

References
  1. Ensure that context provided to LLMs is within reasonable token limits to avoid truncation or API errors.

Comment thread src/api/routes/memory.py
Comment on lines +975 to +980
memory_results, raw_latency = await pipeline.raw_search(
query=req.query,
user_id=user_id,
domains=memory_domains,
top_k=req.top_k,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The raw_search method returns up to top_k results per domain. With multiple domains selected, the final all_results list can contain significantly more than top_k items. If the API consumer expects at most top_k results total, the final list should be truncated after sorting.

References
  1. Ensure search results respect the requested limit (top_k) across all aggregated domains.

@ProtonsAndElectrons
Copy link
Copy Markdown
Author

Follow-up pushed in e555bea:

  • invalidates retrieval/profile caches after successful memory ingest so raw profile reads and tool plans do not stay stale after writes
  • bounds the profile catalog cache with LRU eviction and adds per-user memory versions to retrieval-plan cache keys
  • truncates raw/search results to the requested top_k before optional answer synthesis
  • adds regression coverage for profile cache invalidation and retrieval-plan cache invalidation

Validation:

  • ./.venv/Scripts/python.exe -m pytest tests/integration/test_retrieval_pipeline.py tests/api/test_dependencies_and_routes.py tests/unit/test_schemas.py
  • ./.venv/Scripts/python.exe -m ruff check src/pipelines/retrieval.py src/api/routes/memory.py tests/integration/test_retrieval_pipeline.py src/api/schemas.py
  • ./.venv/Scripts/python.exe -m black --target-version py311 --check src/pipelines/retrieval.py src/api/routes/memory.py tests/integration/test_retrieval_pipeline.py
  • git diff --check

GitHub currently reports 1 check on the new head commit, with 0 failed and 0 pending.

@ProtonsAndElectrons
Copy link
Copy Markdown
Author

Followed up on the remaining review note about the memory route calling the protected code retrieval dispatcher directly.

Changes in 8f0591e:

  • added CodeRetrievalPipeline.raw_search(...) as the public no-LLM code search entry point
  • updated /v1/memory/search code-domain search to call that public method instead of _execute_tool
  • added focused coverage for the public wrapper and the route helper

Validation run locally:

  • .\.venv\Scripts\python.exe -m pytest tests/integration/test_code_retrieval_pipeline.py tests/integration/test_retrieval_pipeline.py tests/api/test_dependencies_and_routes.py tests/unit/test_schemas.py
  • .\.venv\Scripts\python.exe -m ruff check src/pipelines/code_retrieval.py src/api/routes/memory.py tests/integration/test_code_retrieval_pipeline.py tests/api/test_dependencies_and_routes.py
  • .\.venv\Scripts\python.exe -m black --target-version py311 --check src/api/routes/memory.py tests/integration/test_code_retrieval_pipeline.py tests/api/test_dependencies_and_routes.py
  • git diff --check

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 23, 2026

Greptile Summary

This PR moves /v1/memory/search onto a low-latency raw retrieval path that skips LLM tool selection and synthesis by default, adds an answer=true toggle for on-demand synthesis from already-retrieved hits, and introduces profile-catalog and retrieval-plan LRU caches to speed up the agentic /retrieve path.

  • Raw search path (raw_search): concurrently fetches profile, temporal, summary, and snippet domains without any LLM call; merges with optional code-domain results and sorts by score before returning.
  • Plan caching: retrieval plan cache key is now (user_id, query, version, catalog_key)top_k deliberately excluded so the same plan is reused across result-size variants; invalidated on every ingest via invalidate_user_cache.
  • Latency tracking: per-request current_ms is returned to callers; aggregate p50/p95/p99 are logged internally only, addressing prior concerns about infrastructure probing.

Confidence Score: 5/5

Safe to merge; both findings are defensive hardening suggestions with no current incorrect behavior.

The core raw-search path, plan caching, cache invalidation on ingest, and graceful degradation (code/answer failures return partial results rather than 500s) all look correct and are well-covered by new tests. The two suggestions — reversing the metadata merge order in _search_code and adding return_exceptions=True to the code pipeline's asyncio.gather — are improvements to robustness but do not represent wrong behavior under any currently exercised code path.

src/pipelines/code_retrieval.py (partial tool failure handling) and src/api/routes/memory.py (metadata merge order in _search_code)

Important Files Changed

Filename Overview
src/api/routes/memory.py Rewrites search_memory to use raw_search + optional answer synthesis; adds code domain support with graceful degradation; removes old per-domain helper functions; adds cache invalidation on ingest. Dict merge order in _search_code may silently override source_domain.
src/pipelines/retrieval.py Adds raw_search, synthesize_answer, record_latency, and cache invalidation methods; introduces profile catalog LRU cache and retrieval plan LRU cache keyed by user/query/version (top_k excluded). Implementation looks correct.
src/pipelines/code_retrieval.py Adds raw_search that concurrently calls search_symbols and search_files then fuses via RRF. asyncio.gather without return_exceptions=True means a single-tool failure silently discards partial results from the succeeding tool.
src/api/schemas.py Adds answer, confidence, mode, and latency fields to SearchResponse; adds snippet and code to allowed domains; adds org_id, repo, and answer fields to SearchRequest. Formatting-only changes elsewhere.
tests/api/test_dependencies_and_routes.py Adds tests for _search_code delegation, code search failure preserving memory results, and answer synthesis failure preserving raw results. Coverage matches the new graceful-degradation paths.
tests/integration/test_retrieval_pipeline.py Adds integration tests for raw_search (all domains, top-k limiting, profile cache invalidation) and retrieval plan caching (LLM skipped on cache hit, invalidated on user data change).
tests/integration/test_code_retrieval_pipeline.py Adds tests for CodeRetrievalPipeline.raw_search: verifies both tools are called, deduplication via _fuse_source_records, and hard limit enforcement.

Sequence Diagram

sequenceDiagram
    participant Client
    participant SearchEndpoint as POST /v1/memory/search
    participant RetrievalPipeline as RetrievalPipeline.raw_search
    participant CodePipeline as CodeRetrievalPipeline.raw_search
    participant Synthesizer as pipeline.synthesize_answer

    Client->>SearchEndpoint: "{query, domains, top_k, answer?, org_id?}"
    SearchEndpoint->>SearchEndpoint: validate org_id if code in domains

    alt memory_domains non-empty
        SearchEndpoint->>RetrievalPipeline: raw_search(query, user_id, domains, top_k)
        RetrievalPipeline->>RetrievalPipeline: profile (from catalog cache)
        par concurrent
            RetrievalPipeline->>RetrievalPipeline: temporal search
        and
            RetrievalPipeline->>RetrievalPipeline: summary search
        and
            RetrievalPipeline->>RetrievalPipeline: snippet search
        end
        RetrievalPipeline-->>SearchEndpoint: "(records[], latency{mode,current_ms})"
    end

    alt code in domains
        SearchEndpoint->>CodePipeline: raw_search(query, user_id, repo, top_k)
        par concurrent
            CodePipeline->>CodePipeline: _execute_tool(search_symbols)
        and
            CodePipeline->>CodePipeline: _execute_tool(search_files)
        end
        CodePipeline->>CodePipeline: _fuse_source_records(RRF dedupe)
        CodePipeline-->>SearchEndpoint: SourceRecord[]
    end

    SearchEndpoint->>SearchEndpoint: sort and slice to top_k

    alt "req.answer == true"
        SearchEndpoint->>Synthesizer: synthesize_answer(query, all_results)
        Synthesizer->>Synthesizer: _format_tool_results + LLM
        Synthesizer-->>SearchEndpoint: answer string
    end

    SearchEndpoint-->>Client: "SearchResponse{results, answer, mode, latency}"
Loading

Fix All in Cursor Fix All in Codex Fix All in Claude Code

Reviews (6): Last reviewed commit: "Preserve raw results on answer synthesis..." | Re-trigger Greptile

Comment thread src/api/routes/memory.py Outdated
Comment thread src/pipelines/retrieval.py
Comment thread src/pipelines/retrieval.py Outdated
Comment thread src/api/schemas.py
@ProtonsAndElectrons
Copy link
Copy Markdown
Author

Rebased this branch onto current main to clear the merge conflict from the recent upstream local-setup changes.

Local focused validation on c4249289 is green:

  • python -m pytest tests/api/test_dependencies_and_routes.py tests/integration/test_retrieval_pipeline.py tests/integration/test_code_retrieval_pipeline.py
  • git diff --check origin/main...HEAD

The GitHub label check is green; Greptile review is still running on the new head.

@ProtonsAndElectrons
Copy link
Copy Markdown
Author

Follow-up pushed in 4b6e838 addressing the latest review notes:

  • validate missing org_id for code-domain searches before running memory-domain raw search
  • switch retrieval-plan cache to LRU behavior and keep its key independent of top_k
  • keep aggregate latency percentiles internal/metrics-only; API latency payload now exposes only the current request timing
  • adjust focused regression coverage for the public latency payload and top-k-independent plan cache reuse

Validation:

  • python -m pytest tests/api/test_dependencies_and_routes.py tests/integration/test_retrieval_pipeline.py tests/integration/test_code_retrieval_pipeline.py
  • python -m ruff check src/pipelines/retrieval.py src/api/routes/memory.py tests/integration/test_retrieval_pipeline.py
  • python -m black --target-version py311 --check src/pipelines/retrieval.py src/api/routes/memory.py tests/integration/test_retrieval_pipeline.py
  • git diff --check origin/main...HEAD

@ProtonsAndElectrons
Copy link
Copy Markdown
Author

Follow-up pushed in 51ef5e0 based on Greptile's code-domain raw search note:

  • fuse native search_symbols/search_files raw results with reciprocal-rank scoring
  • dedupe repeated source identities while keeping the highest-scored record
  • enforce the requested top_k limit on the final code raw-search result set
  • added regression coverage for duplicate code hits plus final result limiting

Validation:

  • python -m pytest tests/integration/test_code_retrieval_pipeline.py tests/api/test_dependencies_and_routes.py
  • python -m ruff check src/pipelines/code_retrieval.py tests/integration/test_code_retrieval_pipeline.py
  • python -m black --target-version py311 --check tests/integration/test_code_retrieval_pipeline.py
  • git diff --check origin/main...HEAD

Comment thread src/api/routes/memory.py
@ProtonsAndElectrons
Copy link
Copy Markdown
Author

Follow-up pushed in 033b4d6 for Greptile's code-search failure note:

  • catch code-domain raw search failures at the mixed /v1/memory/search route boundary
  • keep already-fetched memory-domain results instead of returning 500 for the whole request
  • still record code-search latency for observability
  • added regression coverage proving a profile + code request returns the profile hit when code search raises

Validation:

  • python -m pytest tests/api/test_dependencies_and_routes.py -q
  • python -m pytest tests/api/test_dependencies_and_routes.py tests/integration/test_retrieval_pipeline.py tests/integration/test_code_retrieval_pipeline.py -q
  • python -m black --target-version py311 --check src/api/routes/memory.py tests/api/test_dependencies_and_routes.py
  • python -m ruff check src/api/routes/memory.py tests/api/test_dependencies_and_routes.py
  • git diff --check origin/main...HEAD

Comment thread src/api/routes/memory.py
@ProtonsAndElectrons
Copy link
Copy Markdown
Author

Follow-up pushed in e718d5b for Greptile's answer-synthesis failure note:

  • catch pipeline.synthesize_answer() failures inside /v1/memory/search
  • keep the successfully retrieved raw results instead of returning 500
  • fall back to answer="", confidence=0.0, and mode="raw"
  • still records answer latency for observability
  • added regression coverage for answer=True preserving raw results when synthesis fails

Validation:

  • python -m pytest tests/api/test_dependencies_and_routes.py -q
  • python -m pytest tests/api/test_dependencies_and_routes.py tests/integration/test_retrieval_pipeline.py tests/integration/test_code_retrieval_pipeline.py -q
  • python -m black --target-version py311 --check src/api/routes/memory.py tests/api/test_dependencies_and_routes.py
  • python -m ruff check src/api/routes/memory.py tests/api/test_dependencies_and_routes.py
  • git diff --check origin/main...HEAD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add low-latency raw search path separate from agentic answer synthesis

1 participant