-
Notifications
You must be signed in to change notification settings - Fork 4
Search design doc #78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: improvement
Are you sure you want to change the base?
Changes from all commits
c72ca5f
f2760bd
e6ef103
21776be
1d33f0f
a88551f
6a26f96
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,100 @@ | ||
| # Search Design Document | ||
|
|
||
| ## Task | ||
| Given a query $q$, the system aims to retrieve the most relevant concept or document $d$ from a candidate set $\mathcal{D}$. In our context, this task necessitates identifying the relevant ontological concepts to a given entity (e.g., in Named Entity Recognition scenarios) or to a broader text input, enabling semantic grounding and structured representation of the extracted information. | ||
|
|
||
| ## Current Situation | ||
|
|
||
| Currently, we use **BioPortal** as our ontology database. BioPortal is a well-established, community-trusted platform that hosts and manages a large number of ontologies. However, these benefits come with several trade-offs: | ||
|
|
||
| 1. **Dependency on BioPortal** — If BioPortal is unavailable (for example, during upgrades), our use case is directly impacted. | ||
| 2. **API rate limits** — Rate limiting can slow down API calls. While this is understandable given BioPortal’s design and shared usage model, it affects performance. | ||
| 3. **Implementation dependency** — We rely on BioPortal’s implementations (e.g., search), which may not always be optimal or fully aligned with our specific use case. | ||
|
|
||
| ## Overview & Requirements | ||
|
|
||
| Before further detail, let’s first understand the steps involved. This task typically consists of two main stages:**retrieval** and **reranking**. | ||
|
|
||
| In the first stage, **retrieval**, the objective is to identify a subset of potentially relevant candidates from $\mathcal{D}$. This is achieved by maximizing a scoring function $f(q, d)$, which estimates the relevance between the query and each candidate document: | ||
|
|
||
| $$ | ||
| d^* = \arg\max_{d \in \mathcal{D}} f(q, d) | ||
| $$ | ||
|
|
||
| Note at this stage we want to prioritize high recall and computational efficiency. | ||
|
|
||
| In the second stage, **reranking**, the retrieved candidates $\mathcal{D}_K$ are re-evaluated using a more expressive (and often computationally expensive) relevance model $g(q, d)$. The goal is to refine the initial ordering by more precisely estimating relevance: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is D_k? How do you get one? How is it related to d^*?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @djarecka
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I understand that D_k is a subset, but how do you get this subset |
||
|
|
||
| $$ | ||
| d^{**} = \arg\max_{d \in \mathcal{D}_K} g(q, d) | ||
| $$ | ||
|
|
||
| With this in mind, we define the following search requirements (focused on algorithms not system design): | ||
|
|
||
| 1. **Contextualized retrieval** — The implementation must overcome the limitations of sparse retrieval methods such as BM25 and basic similarity scoring, which lack contextual understanding. This includes support for contextualized approaches such as cross-encoders, dual-encoders and late interaction models (e.g., ColBERT). | ||
| 2. **Keyword-based retrieval** — The system must also support fast and efficient keyword-based search. | ||
| 3. **Generalizability** — The implementation should be easily adaptable to other use cases with minimal or no additional effort. | ||
|
|
||
| ## Proposed approach | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could you please make connection between the concepts described in the previous section (e.g., Keyword-based retrieval, Contextualized retrieval) and the terms used in the diagram
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @djarecka what do you mean by make connection? Do you want me to include mentions like BM25 for keyword based retrieval?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's really not meant to be tricky, the diagram simply doesn't have the terms that you spent time introducing in the previous section, e.g., Contextualized retrieval, Keyword-based retrieval, etc. I just thought that it might be useful to create the connection between the diagram and the introduction you wrote |
||
| The figure below presents a high-level overview of the proposed approach. Note that not all techniques shown will be used simultaneously; the final selection depends on trade-offs such as accuracy versus computational cost. For example, cross-encoding techniques offer high accuracy but are computationally expensive. Dual-encoder (or bi-encoder) techniques, on the other hand, provide a better balance between accuracy and computational efficiency. | ||
|
|
||
| ```mermaid | ||
| flowchart TB | ||
| subgraph R1["Retrieval"] | ||
| B{"Candidate Retrieval"} | ||
| C1["BM25 / Inverted Index"] | ||
| C2["Vector Search / ANN Index"] | ||
| D["Candidate Set (d*)"] | ||
| end | ||
| subgraph R2["Re-ranking"] | ||
| G{"Scoring / Re-rank"} | ||
| N3["Dual Encoder"] | ||
| H1["Late-Interaction Encoder"] | ||
| H2["Cross-Encoder / LLM Reranker"] | ||
| end | ||
| B -- Keyword (BM25) --> C1 | ||
| B -- Dense Embeddings --> C2 | ||
| C1 --> D | ||
| C2 --> D | ||
| G -- <br> --> N3 | ||
| G --> H1 | ||
| G -- </br> --> H2 | ||
|
tekrajchhetri marked this conversation as resolved.
|
||
| A["Input Text"] --> B | ||
| D --> G | ||
| N3 --> I["Final Results (d**)"] | ||
| H1 --> I | ||
| H2 --> I | ||
| ``` | ||
|
|
||
|
|
||
| ## Implementation | ||
|
|
||
|
tekrajchhetri marked this conversation as resolved.
|
||
| The system will be implemented as an API-first service. Agents and external clients will consume the same API endpoints, ensuring a unified interface, consistent behavior, and eliminating duplicate implementations across tool and service layers. | ||
|
|
||
| The API will encapsulate the full retrieval and ranking pipeline, including candidate retrieval, and reranking. The architecture will remain modular to allow interchangeable ranking components while preserving a stable external interface. | ||
|
|
||
| ### Ranking and Retrieval Strategy | ||
|
|
||
| The implementation will prioritize a hybrid retrieval and reranking approach, combining fast lexical retrieval with dense and late-interaction models for improved accuracy. The following models and methods will be evaluated and integrated where appropriate: | ||
|
|
||
| - Li, Y., Li, J., Yu, M., Ding, G., Lin, Z., Wang, W. and Zhou, J., 2026. Query-focused and Memory-aware Reranker for Long Context Processing. arXiv preprint arXiv:2602.12192. | ||
|
tekrajchhetri marked this conversation as resolved.
|
||
|
|
||
| - Lù, X.H., 2024. Bm25s: Orders of magnitude faster lexical search via eager sparse scoring. arXiv preprint arXiv:2407.03618. | ||
|
|
||
| - Jha, R., Wang, B., Günther, M., Mastrapas, G., Sturua, S., Mohr, I., Koukounas, A., Wang, M.K., Wang, N. and Xiao, H., 2024, November. Jina-colbert-v2: A general-purpose multilingual late interaction retriever. In Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024) (pp. 159-166). | ||
|
|
||
| - Zhang, Y., Li, M., Long, D., Zhang, X., Lin, H., Yang, B., Xie, P., Yang, A., Liu, D., Lin, J. and Huang, F., 2025. Qwen3 embedding: Advancing text embedding and reranking through foundation models. arXiv preprint arXiv:2506.05176. | ||
|
tekrajchhetri marked this conversation as resolved.
|
||
|
|
||
| - BiomedBERT Reranker - [https://huggingface.co/NeuML/biomedbert-base-reranker](https://huggingface.co/NeuML/biomedbert-base-reranker) | ||
|
|
||
| ### Design Principles | ||
|
|
||
| - API-first architecture with a single service interface | ||
| - Modular retrieval and ranking components | ||
| - Support for sparse, dense, and late-interaction methods | ||
| - Configurable reranking layer (neural or LLM-based) | ||
| - Scalability for large candidate concept sets | ||
| - Extensibility for domain-specific models | ||
| - **Provenance by design**, ensuring that model versions, configurations, scoring methods, and retrieval metadata are tracked and reproducible | ||
|
|
||
| This design enables flexible experimentation with retrieval and reranking strategies while maintaining a stable and production-ready service interface. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please make more specific to this case.
is
da specific concept? IsDall ontologies that are provided?