feat(parser): enhance hybrid model support with vector names#35
Conversation
- Added support for specifying dense and sparse vector names in hybrid collections during INSERT and CREATE statements. - Updated the parser to handle new syntax for DENSE VECTOR and SPARSE VECTOR clauses. - Modified the InsertStmt and CreateCollectionStmt to include dense_vector and sparse_vector attributes. - Adjusted related tests to verify the correct handling of vector names in hybrid scenarios. test(tests): update tests for new hybrid vector functionality - Enhanced tests to cover new vector name features in INSERT and CREATE statements. - Updated assertions to check for correct vector names in hybrid configurations. - Refactored existing tests to ensure compatibility with the new parser changes.
|
Is this complete or still needs work ? |
no its still need some checks i think. i will let you know when i am done with the experiment |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (5)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (4)
📝 WalkthroughWalkthroughAdds explicit named dense/sparse vector support across parser, AST, executor, and dumper; introduces CollectionTopology for topology-aware routing/validation; updates config, CLI, documentation, and tests to surface and exercise named-vector behavior. ChangesNamed Vector Support for QQL Collections
🎯 4 (Complex) | ⏱️ ~45 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
tests/test_executor.py (1)
1903-1917: ⚡ Quick winAdd an explicit “unknown dense vector name” negative test.
Great coverage for ambiguity + explicit selection. Add one failure-path assertion where
dense_vectoris set but not present in collection topology, so fallback behavior can’t regress silently.Suggested test addition
+ def test_dense_search_explicit_vector_unknown_name_raises( + self, executor, mock_client + ): + from qdrant_client.models import Distance, VectorParams + + mock_client.collection_exists.return_value = True + mock_client.get_collection.return_value.config.params.vectors = { + "title": VectorParams(size=384, distance=Distance.COSINE), + "body": VectorParams(size=384, distance=Distance.COSINE), + } + mock_client.get_collection.return_value.config.params.sparse_vectors = None + + node = SearchStmt( + collection="col", query_text="q", limit=5, model=None, dense_vector="missing" + ) + with pytest.raises(QQLRuntimeError, match="dense vector"): + executor.execute(node)Also applies to: 1918-1938
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_executor.py` around lines 1903 - 1917, Add a negative test that sets SearchStmt(collection="col", query_text="q", limit=5, model=None, dense_vector="missing_name") and asserts executor.execute(node) raises QQLRuntimeError with a message about an unknown/invalid dense vector name; use the existing mock_client.get_collection.return_value.config.params.vectors (which defines "title" and "body") so the provided dense_vector "missing_name" is not present and the executor path in SearchStmt (and any validation logic) fails instead of silently falling back. Ensure the new test mirrors naming and setup style of test_dense_search_requires_explicit_name_for_ambiguous_collection and references SearchStmt, executor.execute, and mock_client.get_collection to locate the behavior under test.src/qql/executor.py (2)
403-408: 💤 Low valueRedundant embedding of first text.
first_densere-embedsnode.values_list[0]["text"], but this was already embedded in the first loop iteration at line 390. Consider capturing the first vector from the loop to avoid the duplicate embedding call.♻️ Suggested optimization
+ first_dense_vector: list[float] | None = None points: list[PointStruct] = [] for vals in node.values_list: point_id, payload = self._extract_point_id_and_payload(vals) dense_vector = dense_embedder.embed(vals["text"]) + if first_dense_vector is None: + first_dense_vector = dense_vector sparse_obj = sparse_embedder.embed(vals["text"]) ... if not topology.exists: - first_dense = dense_embedder.embed(node.values_list[0]["text"]) self._create_collection_and_wait( collection_name=node.collection, vectors_config={ - dense_name: VectorParams(size=len(first_dense), distance=Distance.COSINE) + dense_name: VectorParams(size=len(first_dense_vector), distance=Distance.COSINE) },🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/qql/executor.py` around lines 403 - 408, The code redundantly calls dense_embedder.embed for node.values_list[0]["text"] when creating the collection; instead, capture the first embedding produced in the loop that processes node.values_list (the same loop that produces embeddings earlier) and reuse that vector for VectorParams in _create_collection_and_wait rather than re-embedding; update the logic around first_dense/dense_name and the loop that generates embeddings (refer to dense_embedder.embed, first_dense, _create_collection_and_wait, and node.values_list) so the first produced vector is stored and passed into the collection creation step when topology.exists is false.
557-560: 💤 Low valueUnused variable; redundant
_resolve_topologycall.
topologyis assigned at line 557 but never used. Meanwhile,_build_vectors_config_diffat line 1336 calls_resolve_topologyagain for the same collection, causing duplicate Qdrant API calls. Consider either removing the unused variable or passingtopologyto_build_vectors_config_diffto reuse it.♻️ Option: Remove unused variable
def _execute_alter_collection(self, node: AlterCollectionStmt) -> ExecutionResult: if not self._client.collection_exists(node.collection): raise QQLRuntimeError(f"Collection '{node.collection}' does not exist") - topology = self._resolve_topology(node.collection) update_kwargs: dict[str, Any] = {"collection_name": node.collection}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/qql/executor.py` around lines 557 - 560, The local variable topology is assigned but unused and causes a duplicate _resolve_topology call inside _build_vectors_config_diff; modify _build_vectors_config_diff to accept a pre-resolved topology (add a parameter like resolved_topology) and use that instead of calling _resolve_topology internally, then update this call site to pass the resolved topology (remove the unused local assignment if you choose to resolve inline) or, alternatively, remove the initial topology assignment and let _build_vectors_config_diff remain responsible for resolving; ensure all other callers of _build_vectors_config_diff are updated to the new signature if you opt to pass topology so you avoid the extra Qdrant API call and eliminate the unused variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/scripts.md`:
- Around line 158-159: Update the example block in docs/scripts.md so its CREATE
and INSERT examples follow the canonical dump syntax: replace occurrences like
"CREATE COLLECTION medical_records HYBRID" with "CREATE COLLECTION <name> USING
HYBRID ..." and change INSERT examples from "] USING HYBRID" style to "INSERT
BULK ... USING HYBRID ..." (and similarly use "USING VECTOR '<vector_name>'" for
named-vector dense collections or plain "CREATE COLLECTION <name>" for unnamed
external collections). Locate the example text mentioning "medical_records" and
the INSERT block and make the CREATE/INSERT lines use the documented "USING
HYBRID" / "USING VECTOR '<name>'" form consistently.
---
Nitpick comments:
In `@src/qql/executor.py`:
- Around line 403-408: The code redundantly calls dense_embedder.embed for
node.values_list[0]["text"] when creating the collection; instead, capture the
first embedding produced in the loop that processes node.values_list (the same
loop that produces embeddings earlier) and reuse that vector for VectorParams in
_create_collection_and_wait rather than re-embedding; update the logic around
first_dense/dense_name and the loop that generates embeddings (refer to
dense_embedder.embed, first_dense, _create_collection_and_wait, and
node.values_list) so the first produced vector is stored and passed into the
collection creation step when topology.exists is false.
- Around line 557-560: The local variable topology is assigned but unused and
causes a duplicate _resolve_topology call inside _build_vectors_config_diff;
modify _build_vectors_config_diff to accept a pre-resolved topology (add a
parameter like resolved_topology) and use that instead of calling
_resolve_topology internally, then update this call site to pass the resolved
topology (remove the unused local assignment if you choose to resolve inline)
or, alternatively, remove the initial topology assignment and let
_build_vectors_config_diff remain responsible for resolving; ensure all other
callers of _build_vectors_config_diff are updated to the new signature if you
opt to pass topology so you avoid the extra Qdrant API call and eliminate the
unused variable.
In `@tests/test_executor.py`:
- Around line 1903-1917: Add a negative test that sets
SearchStmt(collection="col", query_text="q", limit=5, model=None,
dense_vector="missing_name") and asserts executor.execute(node) raises
QQLRuntimeError with a message about an unknown/invalid dense vector name; use
the existing mock_client.get_collection.return_value.config.params.vectors
(which defines "title" and "body") so the provided dense_vector "missing_name"
is not present and the executor path in SearchStmt (and any validation logic)
fails instead of silently falling back. Ensure the new test mirrors naming and
setup style of test_dense_search_requires_explicit_name_for_ambiguous_collection
and references SearchStmt, executor.execute, and mock_client.get_collection to
locate the behavior under test.
🪄 Autofix (Beta)
❌ Autofix failed (check again to retry)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro Plus
Run ID: e8cbe910-5d8b-4347-935b-619980ee515d
📒 Files selected for processing (15)
docs/collections.mddocs/insert.mddocs/programmatic.mddocs/reference.mddocs/scripts.mddocs/search.mdsrc/qql/ast_nodes.pysrc/qql/cli.pysrc/qql/config.pysrc/qql/dumper.pysrc/qql/executor.pysrc/qql/parser.pytests/test_dumper.pytests/test_executor.pytests/test_parser.py
There was a problem hiding this comment.
🧹 Nitpick comments (3)
src/qql/cli.py (1)
113-113: 💤 Low valueClarify optional parameter notation in UPDATE help text.
The notation
['<dense_vector>'](square brackets around a quoted parameter) is ambiguous—it could be misread as requiring literal brackets. Consider either removing the quotes inside the brackets to show[<dense_vector>], or documenting both forms explicitly as done in docs/collections.md lines 392-394.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/qql/cli.py` at line 113, The UPDATE help text currently uses the ambiguous "['<dense_vector>']" notation; update the usage string in src/qql/cli.py (the UPDATE help/usage declaration where "['<dense_vector>']" appears) to use unquoted optional parameter notation like "[<dense_vector>]" instead of including literal quotes/brackets, or alternatively list both accepted forms (e.g., "<dense_vector>" and "['<dense_vector>']") in the same help string to match docs/collections.md lines 392-394; ensure the changed string preserves the surrounding format tokens "[yellow]UPDATE[/yellow] <name> [yellow]SET VECTOR[/yellow] ..." and any tests or callers that parse this help text are updated accordingly.src/qql/executor.py (1)
557-560: 💤 Low valueUnused topology variable causes redundant
get_collectioncall.The
topologyresolved at line 557 is never used in this method. Meanwhile,_build_vectors_config_diffat line 1336 resolves topology again internally, resulting in twoget_collectioncalls per ALTER.Consider either removing the unused assignment or passing the topology to
_build_vectors_config_diffto avoid the redundant network call.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/qql/executor.py` around lines 557 - 560, The local variable topology assigned by calling _resolve_topology(node.collection) in the ALTER flow is unused and causes a redundant get_collection call because _build_vectors_config_diff(node.collection, node.config) resolves topology again; either remove the unused topology assignment or modify _build_vectors_config_diff to accept a pre-resolved topology (e.g., add a parameter and use it instead of calling _resolve_topology/get_collection internally) and update the call site in executor.py to pass the resolved topology so only one get_collection is performed per ALTER.tests/test_executor.py (1)
2924-2936: 💤 Low valueMock for unnamed collection could be more realistic.
Using
[None]works because_resolve_topologytreats any non-dict, non-None value as an unnamed vector collection. However, Qdrant actually returns aVectorParamsobject for unnamed collections.A more representative mock would be a MagicMock simulating VectorParams:
from unittest.mock import MagicMock info.config.params.vectors = MagicMock() # or VectorParams(size=3, distance=Distance.COSINE)This would make the test more robust if
_resolve_topologyever adds stricter type checking.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_executor.py` around lines 2924 - 2936, The test test_update_vector_unnamed_collection_sends_plain_list should mock the collection vector params more realistically: instead of setting info.config.params.vectors = [None], set info.config.params.vectors to a MagicMock (or a VectorParams-like object) so it resembles Qdrant's actual VectorParams; update the test that constructs UpdateVectorStmt and calls executor.execute to use this MagicMock return (via mock_client.get_collection.return_value) so _resolve_topology sees a VectorParams-like object for unnamed collections and the assertions on mock_client.update_vectors (points[0].vector) remain valid.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/qql/cli.py`:
- Line 113: The UPDATE help text currently uses the ambiguous
"['<dense_vector>']" notation; update the usage string in src/qql/cli.py (the
UPDATE help/usage declaration where "['<dense_vector>']" appears) to use
unquoted optional parameter notation like "[<dense_vector>]" instead of
including literal quotes/brackets, or alternatively list both accepted forms
(e.g., "<dense_vector>" and "['<dense_vector>']") in the same help string to
match docs/collections.md lines 392-394; ensure the changed string preserves the
surrounding format tokens "[yellow]UPDATE[/yellow] <name> [yellow]SET
VECTOR[/yellow] ..." and any tests or callers that parse this help text are
updated accordingly.
In `@src/qql/executor.py`:
- Around line 557-560: The local variable topology assigned by calling
_resolve_topology(node.collection) in the ALTER flow is unused and causes a
redundant get_collection call because
_build_vectors_config_diff(node.collection, node.config) resolves topology
again; either remove the unused topology assignment or modify
_build_vectors_config_diff to accept a pre-resolved topology (e.g., add a
parameter and use it instead of calling _resolve_topology/get_collection
internally) and update the call site in executor.py to pass the resolved
topology so only one get_collection is performed per ALTER.
In `@tests/test_executor.py`:
- Around line 2924-2936: The test
test_update_vector_unnamed_collection_sends_plain_list should mock the
collection vector params more realistically: instead of setting
info.config.params.vectors = [None], set info.config.params.vectors to a
MagicMock (or a VectorParams-like object) so it resembles Qdrant's actual
VectorParams; update the test that constructs UpdateVectorStmt and calls
executor.execute to use this MagicMock return (via
mock_client.get_collection.return_value) so _resolve_topology sees a
VectorParams-like object for unnamed collections and the assertions on
mock_client.update_vectors (points[0].vector) remain valid.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 7e9cdf9c-c405-4f3c-9578-c54c548388bc
📒 Files selected for processing (15)
docs/collections.mddocs/insert.mddocs/programmatic.mddocs/reference.mddocs/scripts.mddocs/search.mdsrc/qql/ast_nodes.pysrc/qql/cli.pysrc/qql/config.pysrc/qql/dumper.pysrc/qql/executor.pysrc/qql/parser.pytests/test_dumper.pytests/test_executor.pytests/test_parser.py
|
Note Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it. An unexpected error occurred while generating fixes: Not Found - https://docs.github.com/rest/git/refs#get-a-reference |
…licit vector names
|
@coderabbitai check |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chainI'll review the current state of the PR changes in detail. 🏁 Script executed: Length of output: 70182 🏁 Script executed: Length of output: 34980 🏁 Script executed: Length of output: 20285 🏁 Script executed: Length of output: 6758 🏁 Script executed: Length of output: 14738
Overall AssessmentThe architecture is solid: the new Below are the issues I found, ranked by severity. 🔴 Bugs / Correctness Issues1.
|
… naming requirements
|
@coderabbitai check now once again.. go through the previous comments and fixes we have applied.. check for any gaps still remaining.. |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chainLet me check the current state of the code against all previously raised issues. 🏁 Script executed: Length of output: 15341 🏁 Script executed: Length of output: 7345 🏁 Script executed: Length of output: 183 🏁 Script executed: Length of output: 10303 🏁 Script executed: Length of output: 8323
✅ All Red and Yellow Issues — Resolved
🟠 One Remaining Gap — Executor test for explicit
|
test(tests): update tests for new hybrid vector functionality
Summary by CodeRabbit
New Features
Documentation
Tests