Skip to content

feat(parser): enhance hybrid model support with vector names#35

Merged
pavanjava merged 3 commits into
pavanjava:mainfrom
srimon12:feat/vector-generalization
May 19, 2026
Merged

feat(parser): enhance hybrid model support with vector names#35
pavanjava merged 3 commits into
pavanjava:mainfrom
srimon12:feat/vector-generalization

Conversation

@srimon12
Copy link
Copy Markdown
Collaborator

@srimon12 srimon12 commented May 18, 2026

  • Added support for specifying dense and sparse vector names in hybrid collections during INSERT and CREATE statements.
  • Updated the parser to handle new syntax for DENSE VECTOR and SPARSE VECTOR clauses.
  • Modified the InsertStmt and CreateCollectionStmt to include dense_vector and sparse_vector attributes.
  • Adjusted related tests to verify the correct handling of vector names in hybrid scenarios.

test(tests): update tests for new hybrid vector functionality

  • Enhanced tests to cover new vector name features in INSERT and CREATE statements.
  • Updated assertions to check for correct vector names in hybrid configurations.
  • Refactored existing tests to ensure compatibility with the new parser changes.

Summary by CodeRabbit

  • New Features

    • Support explicit naming of dense/sparse vectors across CREATE/INSERT/SEARCH/UPDATE (USING VECTOR / richer USING HYBRID); UPDATE can target a specific dense vector; ALTER behavior clarified for named/unnamed vectors. Defaults for dense/sparse vector names added.
  • Documentation

    • Expanded examples, CLI help, programmatic docs, and dump output to show named-vector and hybrid syntax and selection rules.
  • Tests

    • Added/updated parser, executor, and dumper tests to cover named-vector parsing, inserts, searches, dumps, and validation.

Review Change Stack

- Added support for specifying dense and sparse vector names in hybrid collections during INSERT and CREATE statements.
- Updated the parser to handle new syntax for DENSE VECTOR and SPARSE VECTOR clauses.
- Modified the InsertStmt and CreateCollectionStmt to include dense_vector and sparse_vector attributes.
- Adjusted related tests to verify the correct handling of vector names in hybrid scenarios.

test(tests): update tests for new hybrid vector functionality

- Enhanced tests to cover new vector name features in INSERT and CREATE statements.
- Updated assertions to check for correct vector names in hybrid configurations.
- Refactored existing tests to ensure compatibility with the new parser changes.
@pavanjava
Copy link
Copy Markdown
Owner

Is this complete or still needs work ?

@srimon12
Copy link
Copy Markdown
Collaborator Author

Is this complete or still needs work ?

no its still need some checks i think. i will let you know when i am done with the experiment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: ff6a272f-497c-4da6-bedd-7405992643c8

📥 Commits

Reviewing files that changed from the base of the PR and between 9cf7b3b and 65994fd.

📒 Files selected for processing (5)
  • docs/collections.md
  • src/qql/__init__.py
  • src/qql/executor.py
  • tests/test_executor.py
  • tests/test_parser.py
✅ Files skipped from review due to trivial changes (1)
  • docs/collections.md
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/qql/init.py
  • tests/test_parser.py
  • tests/test_executor.py
  • src/qql/executor.py

📝 Walkthrough

Walkthrough

Adds explicit named dense/sparse vector support across parser, AST, executor, and dumper; introduces CollectionTopology for topology-aware routing/validation; updates config, CLI, documentation, and tests to surface and exercise named-vector behavior.

Changes

Named Vector Support for QQL Collections

Layer / File(s) Summary
Documentation and examples
docs/collections.md, docs/insert.md, docs/programmatic.md, docs/reference.md, docs/scripts.md, docs/search.md
Docs updated with USING VECTOR '<name>' and expanded USING HYBRID DENSE VECTOR '<dense>' SPARSE VECTOR '<sparse>' syntax, examples, dump/script examples, and ALTER/UPDATE notes about named-vector behavior.
CLI help text
src/qql/cli.py
HELP_TEXT entries extended to document the new USING VECTOR and expanded USING HYBRID/UPDATE SET VECTOR '<name>' grammar.
Configuration defaults
src/qql/config.py
Adds DEFAULT_DENSE_VECTOR_NAME / DEFAULT_SPARSE_VECTOR_NAME and extends QQLConfig with default_dense_vector_name and default_sparse_vector_name; load_config reads persisted defaults.
AST, parser, and parser tests
src/qql/ast_nodes.py, src/qql/parser.py, tests/test_parser.py
AST dataclasses gain dense_vector/sparse_vector/vector_name fields; parser recognizes USING VECTOR and more detailed USING HYBRID sub-clauses (MODEL vs VECTOR) and populates new fields; parser tests added/updated to cover these forms.
Dumper: topology-aware serialization
src/qql/dumper.py, tests/test_dumper.py
Dumper derives collection vector topology and emits explicit USING VECTOR / USING HYBRID DENSE VECTOR ... SPARSE VECTOR ... clauses; tests updated to assert full explicit clauses.
Executor: CollectionTopology model
src/qql/executor.py
Introduces frozen CollectionTopology with properties/helpers (has_dense/has_sparse/is_hybrid, dense_using/dense_payload_name/dense_config_key, sparse_using) and _resolve_topology plus default-vector-name getters.
Executor operations with topology routing
src/qql/executor.py, tests/test_executor.py
Refactors INSERT (single/bulk), CREATE/ALTER COLLECTION, SEARCH (flat/grouped/hybrid/sparse/dense), UPDATE VECTOR, and helpers to use topology-driven vector naming and payload shaping; implements topology-aware dimension validation and config diff construction; tests refactored with shared mocks and autouse fixtures to validate named-vector propagation and ambiguity handling.

🎯 4 (Complex) | ⏱️ ~45 minutes

🐰 A topology of vectors, now plain to see,
Dense, sparse, hybrid—name them all with glee!
From parser to executor, each path aligned,
Explicit vector choices, beautifully refined.
No more ambiguity in the search design!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.76% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main change: parser enhancements to support named vector syntax (dense_vector, sparse_vector) in hybrid model statements. It reflects the primary PR objective.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@srimon12 srimon12 marked this pull request as ready for review May 19, 2026 13:24
@srimon12 srimon12 requested a review from pavanjava May 19, 2026 13:25
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
tests/test_executor.py (1)

1903-1917: ⚡ Quick win

Add an explicit “unknown dense vector name” negative test.

Great coverage for ambiguity + explicit selection. Add one failure-path assertion where dense_vector is set but not present in collection topology, so fallback behavior can’t regress silently.

Suggested test addition
+    def test_dense_search_explicit_vector_unknown_name_raises(
+        self, executor, mock_client
+    ):
+        from qdrant_client.models import Distance, VectorParams
+
+        mock_client.collection_exists.return_value = True
+        mock_client.get_collection.return_value.config.params.vectors = {
+            "title": VectorParams(size=384, distance=Distance.COSINE),
+            "body": VectorParams(size=384, distance=Distance.COSINE),
+        }
+        mock_client.get_collection.return_value.config.params.sparse_vectors = None
+
+        node = SearchStmt(
+            collection="col", query_text="q", limit=5, model=None, dense_vector="missing"
+        )
+        with pytest.raises(QQLRuntimeError, match="dense vector"):
+            executor.execute(node)

Also applies to: 1918-1938

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_executor.py` around lines 1903 - 1917, Add a negative test that
sets SearchStmt(collection="col", query_text="q", limit=5, model=None,
dense_vector="missing_name") and asserts executor.execute(node) raises
QQLRuntimeError with a message about an unknown/invalid dense vector name; use
the existing mock_client.get_collection.return_value.config.params.vectors
(which defines "title" and "body") so the provided dense_vector "missing_name"
is not present and the executor path in SearchStmt (and any validation logic)
fails instead of silently falling back. Ensure the new test mirrors naming and
setup style of test_dense_search_requires_explicit_name_for_ambiguous_collection
and references SearchStmt, executor.execute, and mock_client.get_collection to
locate the behavior under test.
src/qql/executor.py (2)

403-408: 💤 Low value

Redundant embedding of first text.

first_dense re-embeds node.values_list[0]["text"], but this was already embedded in the first loop iteration at line 390. Consider capturing the first vector from the loop to avoid the duplicate embedding call.

♻️ Suggested optimization
+            first_dense_vector: list[float] | None = None
             points: list[PointStruct] = []
             for vals in node.values_list:
                 point_id, payload = self._extract_point_id_and_payload(vals)
                 dense_vector = dense_embedder.embed(vals["text"])
+                if first_dense_vector is None:
+                    first_dense_vector = dense_vector
                 sparse_obj = sparse_embedder.embed(vals["text"])
                 ...

             if not topology.exists:
-                first_dense = dense_embedder.embed(node.values_list[0]["text"])
                 self._create_collection_and_wait(
                     collection_name=node.collection,
                     vectors_config={
-                        dense_name: VectorParams(size=len(first_dense), distance=Distance.COSINE)
+                        dense_name: VectorParams(size=len(first_dense_vector), distance=Distance.COSINE)
                     },
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/qql/executor.py` around lines 403 - 408, The code redundantly calls
dense_embedder.embed for node.values_list[0]["text"] when creating the
collection; instead, capture the first embedding produced in the loop that
processes node.values_list (the same loop that produces embeddings earlier) and
reuse that vector for VectorParams in _create_collection_and_wait rather than
re-embedding; update the logic around first_dense/dense_name and the loop that
generates embeddings (refer to dense_embedder.embed, first_dense,
_create_collection_and_wait, and node.values_list) so the first produced vector
is stored and passed into the collection creation step when topology.exists is
false.

557-560: 💤 Low value

Unused variable; redundant _resolve_topology call.

topology is assigned at line 557 but never used. Meanwhile, _build_vectors_config_diff at line 1336 calls _resolve_topology again for the same collection, causing duplicate Qdrant API calls. Consider either removing the unused variable or passing topology to _build_vectors_config_diff to reuse it.

♻️ Option: Remove unused variable
     def _execute_alter_collection(self, node: AlterCollectionStmt) -> ExecutionResult:
         if not self._client.collection_exists(node.collection):
             raise QQLRuntimeError(f"Collection '{node.collection}' does not exist")
-        topology = self._resolve_topology(node.collection)

         update_kwargs: dict[str, Any] = {"collection_name": node.collection}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/qql/executor.py` around lines 557 - 560, The local variable topology is
assigned but unused and causes a duplicate _resolve_topology call inside
_build_vectors_config_diff; modify _build_vectors_config_diff to accept a
pre-resolved topology (add a parameter like resolved_topology) and use that
instead of calling _resolve_topology internally, then update this call site to
pass the resolved topology (remove the unused local assignment if you choose to
resolve inline) or, alternatively, remove the initial topology assignment and
let _build_vectors_config_diff remain responsible for resolving; ensure all
other callers of _build_vectors_config_diff are updated to the new signature if
you opt to pass topology so you avoid the extra Qdrant API call and eliminate
the unused variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/scripts.md`:
- Around line 158-159: Update the example block in docs/scripts.md so its CREATE
and INSERT examples follow the canonical dump syntax: replace occurrences like
"CREATE COLLECTION medical_records HYBRID" with "CREATE COLLECTION <name> USING
HYBRID ..." and change INSERT examples from "] USING HYBRID" style to "INSERT
BULK ... USING HYBRID ..." (and similarly use "USING VECTOR '<vector_name>'" for
named-vector dense collections or plain "CREATE COLLECTION <name>" for unnamed
external collections). Locate the example text mentioning "medical_records" and
the INSERT block and make the CREATE/INSERT lines use the documented "USING
HYBRID" / "USING VECTOR '<name>'" form consistently.

---

Nitpick comments:
In `@src/qql/executor.py`:
- Around line 403-408: The code redundantly calls dense_embedder.embed for
node.values_list[0]["text"] when creating the collection; instead, capture the
first embedding produced in the loop that processes node.values_list (the same
loop that produces embeddings earlier) and reuse that vector for VectorParams in
_create_collection_and_wait rather than re-embedding; update the logic around
first_dense/dense_name and the loop that generates embeddings (refer to
dense_embedder.embed, first_dense, _create_collection_and_wait, and
node.values_list) so the first produced vector is stored and passed into the
collection creation step when topology.exists is false.
- Around line 557-560: The local variable topology is assigned but unused and
causes a duplicate _resolve_topology call inside _build_vectors_config_diff;
modify _build_vectors_config_diff to accept a pre-resolved topology (add a
parameter like resolved_topology) and use that instead of calling
_resolve_topology internally, then update this call site to pass the resolved
topology (remove the unused local assignment if you choose to resolve inline)
or, alternatively, remove the initial topology assignment and let
_build_vectors_config_diff remain responsible for resolving; ensure all other
callers of _build_vectors_config_diff are updated to the new signature if you
opt to pass topology so you avoid the extra Qdrant API call and eliminate the
unused variable.

In `@tests/test_executor.py`:
- Around line 1903-1917: Add a negative test that sets
SearchStmt(collection="col", query_text="q", limit=5, model=None,
dense_vector="missing_name") and asserts executor.execute(node) raises
QQLRuntimeError with a message about an unknown/invalid dense vector name; use
the existing mock_client.get_collection.return_value.config.params.vectors
(which defines "title" and "body") so the provided dense_vector "missing_name"
is not present and the executor path in SearchStmt (and any validation logic)
fails instead of silently falling back. Ensure the new test mirrors naming and
setup style of test_dense_search_requires_explicit_name_for_ambiguous_collection
and references SearchStmt, executor.execute, and mock_client.get_collection to
locate the behavior under test.
🪄 Autofix (Beta)

❌ Autofix failed (check again to retry)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: e8cbe910-5d8b-4347-935b-619980ee515d

📥 Commits

Reviewing files that changed from the base of the PR and between 7b7d434 and cf77e7d.

📒 Files selected for processing (15)
  • docs/collections.md
  • docs/insert.md
  • docs/programmatic.md
  • docs/reference.md
  • docs/scripts.md
  • docs/search.md
  • src/qql/ast_nodes.py
  • src/qql/cli.py
  • src/qql/config.py
  • src/qql/dumper.py
  • src/qql/executor.py
  • src/qql/parser.py
  • tests/test_dumper.py
  • tests/test_executor.py
  • tests/test_parser.py

Comment thread docs/scripts.md
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
src/qql/cli.py (1)

113-113: 💤 Low value

Clarify optional parameter notation in UPDATE help text.

The notation ['<dense_vector>'] (square brackets around a quoted parameter) is ambiguous—it could be misread as requiring literal brackets. Consider either removing the quotes inside the brackets to show [<dense_vector>], or documenting both forms explicitly as done in docs/collections.md lines 392-394.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/qql/cli.py` at line 113, The UPDATE help text currently uses the
ambiguous "['<dense_vector>']" notation; update the usage string in
src/qql/cli.py (the UPDATE help/usage declaration where "['<dense_vector>']"
appears) to use unquoted optional parameter notation like "[<dense_vector>]"
instead of including literal quotes/brackets, or alternatively list both
accepted forms (e.g., "<dense_vector>" and "['<dense_vector>']") in the same
help string to match docs/collections.md lines 392-394; ensure the changed
string preserves the surrounding format tokens "[yellow]UPDATE[/yellow] <name>
[yellow]SET VECTOR[/yellow] ..." and any tests or callers that parse this help
text are updated accordingly.
src/qql/executor.py (1)

557-560: 💤 Low value

Unused topology variable causes redundant get_collection call.

The topology resolved at line 557 is never used in this method. Meanwhile, _build_vectors_config_diff at line 1336 resolves topology again internally, resulting in two get_collection calls per ALTER.

Consider either removing the unused assignment or passing the topology to _build_vectors_config_diff to avoid the redundant network call.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/qql/executor.py` around lines 557 - 560, The local variable topology
assigned by calling _resolve_topology(node.collection) in the ALTER flow is
unused and causes a redundant get_collection call because
_build_vectors_config_diff(node.collection, node.config) resolves topology
again; either remove the unused topology assignment or modify
_build_vectors_config_diff to accept a pre-resolved topology (e.g., add a
parameter and use it instead of calling _resolve_topology/get_collection
internally) and update the call site in executor.py to pass the resolved
topology so only one get_collection is performed per ALTER.
tests/test_executor.py (1)

2924-2936: 💤 Low value

Mock for unnamed collection could be more realistic.

Using [None] works because _resolve_topology treats any non-dict, non-None value as an unnamed vector collection. However, Qdrant actually returns a VectorParams object for unnamed collections.

A more representative mock would be a MagicMock simulating VectorParams:

from unittest.mock import MagicMock
info.config.params.vectors = MagicMock()  # or VectorParams(size=3, distance=Distance.COSINE)

This would make the test more robust if _resolve_topology ever adds stricter type checking.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_executor.py` around lines 2924 - 2936, The test
test_update_vector_unnamed_collection_sends_plain_list should mock the
collection vector params more realistically: instead of setting
info.config.params.vectors = [None], set info.config.params.vectors to a
MagicMock (or a VectorParams-like object) so it resembles Qdrant's actual
VectorParams; update the test that constructs UpdateVectorStmt and calls
executor.execute to use this MagicMock return (via
mock_client.get_collection.return_value) so _resolve_topology sees a
VectorParams-like object for unnamed collections and the assertions on
mock_client.update_vectors (points[0].vector) remain valid.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/qql/cli.py`:
- Line 113: The UPDATE help text currently uses the ambiguous
"['<dense_vector>']" notation; update the usage string in src/qql/cli.py (the
UPDATE help/usage declaration where "['<dense_vector>']" appears) to use
unquoted optional parameter notation like "[<dense_vector>]" instead of
including literal quotes/brackets, or alternatively list both accepted forms
(e.g., "<dense_vector>" and "['<dense_vector>']") in the same help string to
match docs/collections.md lines 392-394; ensure the changed string preserves the
surrounding format tokens "[yellow]UPDATE[/yellow] <name> [yellow]SET
VECTOR[/yellow] ..." and any tests or callers that parse this help text are
updated accordingly.

In `@src/qql/executor.py`:
- Around line 557-560: The local variable topology assigned by calling
_resolve_topology(node.collection) in the ALTER flow is unused and causes a
redundant get_collection call because
_build_vectors_config_diff(node.collection, node.config) resolves topology
again; either remove the unused topology assignment or modify
_build_vectors_config_diff to accept a pre-resolved topology (e.g., add a
parameter and use it instead of calling _resolve_topology/get_collection
internally) and update the call site in executor.py to pass the resolved
topology so only one get_collection is performed per ALTER.

In `@tests/test_executor.py`:
- Around line 2924-2936: The test
test_update_vector_unnamed_collection_sends_plain_list should mock the
collection vector params more realistically: instead of setting
info.config.params.vectors = [None], set info.config.params.vectors to a
MagicMock (or a VectorParams-like object) so it resembles Qdrant's actual
VectorParams; update the test that constructs UpdateVectorStmt and calls
executor.execute to use this MagicMock return (via
mock_client.get_collection.return_value) so _resolve_topology sees a
VectorParams-like object for unnamed collections and the assertions on
mock_client.update_vectors (points[0].vector) remain valid.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 7e9cdf9c-c405-4f3c-9578-c54c548388bc

📥 Commits

Reviewing files that changed from the base of the PR and between 7b7d434 and cf77e7d.

📒 Files selected for processing (15)
  • docs/collections.md
  • docs/insert.md
  • docs/programmatic.md
  • docs/reference.md
  • docs/scripts.md
  • docs/search.md
  • src/qql/ast_nodes.py
  • src/qql/cli.py
  • src/qql/config.py
  • src/qql/dumper.py
  • src/qql/executor.py
  • src/qql/parser.py
  • tests/test_dumper.py
  • tests/test_executor.py
  • tests/test_parser.py

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

An unexpected error occurred while generating fixes: Not Found - https://docs.github.com/rest/git/refs#get-a-reference

@srimon12
Copy link
Copy Markdown
Collaborator Author

@coderabbitai check

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

I'll review the current state of the PR changes in detail.


🏁 Script executed:

#!/bin/bash
cat src/qql/ast_nodes.py
echo "---"
cat src/qql/parser.py
echo "---"
cat src/qql/config.py

Length of output: 70182


🏁 Script executed:

#!/bin/bash
cat src/qql/executor.py | head -300
echo "=== MIDDLE ==="
cat src/qql/executor.py | sed -n '300,600p'
echo "=== LATER ==="
cat src/qql/executor.py | sed -n '600,900p'

Length of output: 34980


🏁 Script executed:

#!/bin/bash
cat src/qql/executor.py | sed -n '900,1200p'
echo "=== END ==="
cat src/qql/executor.py | tail -200

Length of output: 20285


🏁 Script executed:

#!/bin/bash
# Find key methods in executor.py
rg -n "_build_dense_point_vector|_execute_update_vector|_execute_search_groups|_build_vectors_config_diff" src/qql/executor.py
echo "---"
# Get those method implementations
ast-grep --pattern 'def _build_dense_point_vector($$$) {
  $$$
}'
echo "---"
sed -n '/def _build_dense_point_vector/,/^    def /p' src/qql/executor.py | head -60
echo "---"
sed -n '/def _execute_update_vector/,/^    def /p' src/qql/executor.py | head -80
echo "---"
sed -n '/def _execute_search_groups/,/^    def /p' src/qql/executor.py | head -100
echo "---"
sed -n '/def _build_vectors_config_diff/,/^    def /p' src/qql/executor.py | head -60

Length of output: 6758


🏁 Script executed:

#!/bin/bash
cat tests/test_parser.py | grep -A 30 "dense_vector\|sparse_vector\|vector_name\|USING VECTOR\|USING HYBRID.*VECTOR"
echo "==="
cat tests/test_dumper.py | grep -A 10 "hybrid\|DENSE VECTOR\|SPARSE VECTOR"
echo "==="
# Check for any unresolved TODOs or FIXMEs
rg -n "TODO|FIXME|HACK|XXX|NOQA|type: ignore" src/qql/executor.py src/qql/parser.py src/qql/ast_nodes.py
echo "==="
# Check for potential NameError in insert_bulk dense path
sed -n '/def _execute_insert_bulk/,/def _execute_/p' src/qql/executor.py | grep -n "vector\|len(vec"

Length of output: 14738


@srimon12 — here's my full review of the current state of the changes.


Overall Assessment

The architecture is solid: the new CollectionTopology dataclass cleanly centralises vector-structure awareness, and the dense_using() / sparse_using() / dense_payload_name() helpers are well-designed and consistently used across all search and insert paths. The parser additions are correct and the AST changes are minimal and precise.

Below are the issues I found, ranked by severity.


🔴 Bugs / Correctness Issues

1. _execute_insert_bulkvector referenced after loop (potential NameError on empty list edge-case, semantic fragility)

In the dense-only bulk INSERT path, _ensure_collection is called after the for loop using the last value of vector assigned inside the loop:

# executor.py — dense-only _execute_insert_bulk
for vals in node.values_list:
    vector = embedder.embed(vals["text"])   # assigned in loop
    ...
    points.append(...)

self._ensure_collection(
    node.collection, len(vector), ...  # ← `vector` leaks out from loop scope
)

Although the empty-list guard at the top of the function prevents a true NameError, this is semantically fragile and confusing. The embedding used for the dimension check is non-deterministic (whichever ran last). Consider embedding the first item before the loop and using it for both the size check and the first point:

first_text = node.values_list[0]["text"]
first_vector = embedder.embed(first_text)
vector_size = len(first_vector)

points = []
for i, vals in enumerate(node.values_list):
    vec = first_vector if i == 0 else embedder.embed(vals["text"])
    ...

self._ensure_collection(node.collection, vector_size, topology, node.dense_vector)

2. _build_vectors_config_diff — no explicit_vector argument, fails silently on multi-named-dense collections

def _build_vectors_config_diff(self, topology, config):
    ...
    vector_name = topology.dense_config_key()   # ← no explicit_vector passed
    return {vector_name: VectorParamsDiff(on_disk=config.vectors.on_disk)}

dense_config_key() calls dense_using() with explicit=None. If a collection has multiple named dense vectors, dense_using() raises QQLRuntimeError("Collection has multiple dense vectors; specify one with USING VECTOR '<name>'"). ALTER COLLECTION … WITH VECTORS {on_disk: true} would crash on any multi-dense-vector collection. This method should either accept and forward an explicit_vector parameter or document the limitation.


🟡 Behavioural Changes Worth Documenting / Validating

3. CREATE COLLECTION name (no USING) now always creates a named dense vector called "dense"

# executor.py _execute_create — dense-only branch
dense_name = node.dense_vector or self._default_dense_vector_name()   # always "dense"
create_kwargs = {
    "vectors_config": {dense_name: VectorParams(...)}   # always named
}

Before this PR, simple CREATE COLLECTION articles likely created an unnamed dense vector (Qdrant's single-vector mode). Now it creates a collection with vector name "dense". This is a breaking change for any existing dump/restore workflows that produced unnamed-vector collections. Make sure this is intentional and documented (the docs/scripts.md change does mention this, but it's worth a changelog note too).


4. CollectionTopology.dense_using() returns None for unnamed-dense in the hybrid INSERT path — fallback may silently use the wrong name

# _execute_insert hybrid path
dense_name = node.dense_vector or self._default_dense_vector_name()
if topology.exists:
    dense_name = topology.dense_using(node.dense_vector) or dense_name  # None → fallback

dense_using() returns None for an unnamed single-dense vector. If someone tries a hybrid INSERT into a collection that has an unnamed dense vector paired with a sparse vector (unusual but theoretically reachable), the or dense_name fallback would silently use "dense" as the key and Qdrant would reject the point with a confusing error. A more explicit guard here would improve debuggability:

resolved = topology.dense_using(node.dense_vector)
if resolved is None and topology.is_hybrid:
    raise QQLRuntimeError(
        "Hybrid collections must use named dense vectors; collection appears misconfigured"
    )
dense_name = resolved or dense_name

🟠 Missing Test Coverage

5. No parser test for INSERT BULK … USING HYBRID DENSE VECTOR … SPARSE VECTOR …

_parse_insert_bulk_body has the same HYBRID sub-clause parsing as _parse_insert, but there is no corresponding test in tests/test_parser.py for the bulk variant with explicit vector names. Add:

def test_insert_bulk_hybrid_vector_names(self):
    node = parse(
        "INSERT BULK INTO COLLECTION col VALUES [{'text': 'hi'}] "
        "USING HYBRID DENSE VECTOR 'emb' SPARSE VECTOR 'lex'"
    )
    assert isinstance(node, InsertBulkStmt)
    assert node.hybrid is True
    assert node.dense_vector == "emb"
    assert node.sparse_vector == "lex"

6. No executor test for UPDATE SET VECTOR 'name' on named-vector collections

_execute_update_vector correctly calls topology.dense_payload_name(node.vector_name), but there's no executor-level test verifying that the right vector name is forwarded to update_vectors().


🟢 Minor / Nits

7. __init__.py exports DEFAULT_MODEL but not the new defaults

src/qql/__init__.py now exports DEFAULT_MODEL. For consistency, consider also exporting DEFAULT_DENSE_VECTOR_NAME and DEFAULT_SPARSE_VECTOR_NAME from config.py since callers may want to reference them.

8. Legacy CREATE COLLECTION name HYBRID path still accepted in parser

# parser.py _parse_create
if self._peek().kind == TokenKind.HYBRID:
    self._advance()
    hybrid = True          # ← dense_vector / sparse_vector remain None

This backward-compat path works fine, but it is silently unable to carry vector names. Worth a comment or, eventually, a deprecation note.

9. CollectionTopology.sparse_using() has no guard for topology.exists=False

If sparse_using() is ever called on a CollectionTopology(exists=False), it raises "Collection has no sparse vector" — which is technically correct but not very helpful. Currently all callers correctly avoid calling it on non-existent topologies, but a docstring note would help future contributors.


Let me know when you're ready for another look once the remaining work is done. 🐇

@srimon12
Copy link
Copy Markdown
Collaborator Author

@coderabbitai check now once again.. go through the previous comments and fixes we have applied.. check for any gaps still remaining..

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

Let me check the current state of the code against all previously raised issues.


🏁 Script executed:

#!/bin/bash
echo "=== _execute_insert_bulk (vector variable after loop) ==="
sed -n '/def _execute_insert_bulk/,/^    def [a-z]/p' src/qql/executor.py | head -80

echo ""
echo "=== _build_vectors_config_diff ==="
sed -n '/def _build_vectors_config_diff/,/^    def [a-z]/p' src/qql/executor.py | head -40

echo ""
echo "=== dense_using fallback in hybrid INSERT ==="
sed -n '/def _execute_insert\b/,/^    def [a-z]/p' src/qql/executor.py | head -80

echo ""
echo "=== CollectionTopology dense_using / sparse_using ==="
sed -n '/class CollectionTopology/,/^class [A-Z]/p' src/qql/executor.py | head -80

echo ""
echo "=== __init__.py exports ==="
cat src/qql/__init__.py

echo ""
echo "=== Legacy HYBRID path in parser ==="
sed -n '/def _parse_create/,/^    def [a-z]/p' src/qql/parser.py | grep -A5 -B2 "HYBRID"

echo ""
echo "=== test_parser.py — INSERT BULK hybrid vector names ==="
rg -n "insert_bulk.*hybrid\|hybrid.*bulk\|BULK.*DENSE\|DENSE.*BULK" tests/test_parser.py

echo ""
echo "=== test_executor.py — UPDATE SET VECTOR named ==="
rg -n "update_vector\|UPDATE.*VECTOR\|vector_name\|SET VECTOR" tests/test_executor.py | head -30

echo ""
echo "=== sparse_using docstring ==="
rg -n "sparse_using\|docstring\|\"\"\"" src/qql/executor.py | head -30

Length of output: 15341


🏁 Script executed:

#!/bin/bash
echo "=== Dense-only INSERT BULK tail (first_vector pattern) ==="
sed -n '/# ── Standard dense-only bulk INSERT/,/^    def [a-z]/p' src/qql/executor.py | head -60

echo ""
echo "=== test_parser.py — all INSERT BULK tests ==="
rg -n "insert_bulk\|INSERT BULK\|InsertBulkStmt" tests/test_parser.py | head -30

echo ""
echo "=== test_executor.py — UpdateVector tests ==="
rg -n "UpdateVector\|update_vector\|_execute_update" tests/test_executor.py | head -20

echo ""
echo "=== sparse_using has docstring? ==="
sed -n '/def sparse_using/,/def [a-z]/p' src/qql/executor.py | head -20

echo ""
echo "=== Legacy bare HYBRID parse path (without USING) ==="
sed -n '/def _parse_create/,/^    def [a-z]/p' src/qql/parser.py | head -60

echo ""
echo "=== test_parser.py: are there any hybrid BULK tests? ==="
grep -n -A5 "BULK.*HYBRID\|hybrid.*bulk\|DENSE VECTOR\|dense_vector.*bulk" tests/test_parser.py | head -40

Length of output: 7345


🏁 Script executed:

#!/bin/bash
echo "=== All UpdateVector / update_vector references in tests ==="
rg -n "UpdateVector\|update_vector\|_execute_update\|SET VECTOR\|TestUpdate" tests/test_executor.py tests/test_parser.py 2>/dev/null | head -40

echo ""
echo "=== Check for vector_name in test_parser.py ==="
rg -n "vector_name\|SET VECTOR" tests/test_parser.py | head -20

echo ""
echo "=== Check test_executor.py for any update-related class or function ==="
rg -n "def test.*update\|class Test.*update\|class Test.*Update" tests/test_executor.py | head -20

Length of output: 183


🏁 Script executed:

#!/bin/bash
echo "=== Broad search for 'update' across test files ==="
rg -in "update" tests/test_executor.py | head -30

echo ""
echo "=== Broad search for 'update' across test_parser.py ==="
rg -in "update" tests/test_parser.py | head -30

echo ""
echo "=== All test class names in test_executor.py ==="
rg -n "^class Test" tests/test_executor.py | head -40

echo ""
echo "=== All test class names in test_parser.py ==="
rg -n "^class Test\|def test_" tests/test_parser.py | grep -i "update\|vector_name" | head -20

echo ""
echo "=== All test function names in test_parser.py (last 100 lines) ==="
tail -120 tests/test_parser.py

Length of output: 10303


🏁 Script executed:

#!/bin/bash
echo "=== TestUpdateVector executor tests (full) ==="
sed -n '/^class TestUpdateVector/,/^class Test/p' tests/test_executor.py | head -80

echo ""
echo "=== TestUpdateVectorVectorShape ==="
sed -n '/^class TestUpdateVectorVectorShape/,/^class Test/p' tests/test_executor.py | head -60

echo ""
echo "=== _execute_update_vector implementation ==="
sed -n '/def _execute_update_vector/,/^    def [a-z]/p' src/qql/executor.py | head -50

Length of output: 8323


@srimon12 — great progress! Here's the gap analysis against all my previous findings.


✅ All Red and Yellow Issues — Resolved

# Issue Status
1 _execute_insert_bulk — leaky vector after loop ✅ Fixed (`first_vector: list[float]
2 _build_vectors_config_diff — silent crash on multi-dense collections ✅ Fixed (catches QQLRuntimeError, re-raises with clear ALTER COLLECTION message)
3 dense_using() silent None fallback in hybrid INSERT ✅ Fixed (raises explicit QQLRuntimeError("Hybrid collections must use named dense vectors"))
4 __init__.py missing new default exports ✅ Fixed (DEFAULT_DENSE_VECTOR_NAME and DEFAULT_SPARSE_VECTOR_NAME both imported and in __all__)
5 Missing parser test: INSERT BULK … USING HYBRID DENSE VECTOR … SPARSE VECTOR … ✅ Fixed (test confirmed in test_parser.py)
6 Parser test for UPDATE SET VECTOR 'name' with vector_name ✅ Fixed (test_update_vector_with_vector_name added)

🟠 One Remaining Gap — Executor test for explicit vector_name in UPDATE

TestUpdateVectorVectorShape.test_update_vector_named_collection_sends_dict covers topology-auto-resolution (passes no vector_name), but no test exercises the explicit vector_name path through _execute_update_vector:

# _execute_update_vector
vector_name = topology.dense_payload_name(node.vector_name)  # ← explicit path untested

A test like this is still missing:

def test_update_vector_explicit_vector_name_sends_correct_key(self, executor, mock_client):
    from qdrant_client.models import Distance, VectorParams
    from qql.ast_nodes import UpdateVectorStmt

    mock_client.collection_exists.return_value = True
    info = mock_client.get_collection.return_value
    info.config.params.vectors = {
        "emb": VectorParams(size=2, distance=Distance.COSINE),
        "body": VectorParams(size=2, distance=Distance.COSINE),
    }
    info.config.params.sparse_vectors = None

    node = UpdateVectorStmt(
        collection="articles", point_id="id-1", vector=(0.5, 0.6), vector_name="emb"
    )
    executor.execute(node)
    pv = mock_client.update_vectors.call_args.kwargs["points"][0]
    assert isinstance(pv.vector, dict)
    assert "emb" in pv.vector
    assert pv.vector["emb"] == [0.5, 0.6]

This test would verify that when vector_name="emb" is passed, dense_payload_name("emb") resolves to "emb" and the correct key is used in the PointVectors payload — particularly important for multi-dense-vector collections where an incorrect resolution would silently write to the wrong vector.


🟢 Minor Nits (unchanged from before)

  • Legacy bare HYBRID parser path in _parse_create still accepts CREATE COLLECTION name HYBRID (without USING) without any comment — worth a # backward-compat: deprecated note.
  • sparse_using() docstring — calling on a CollectionTopology(exists=False) produces a confusing error; a short docstring note would help future contributors.

Overall the PR is in very good shape — only the explicit vector_name executor test is left to add before this is production-ready. 🐇

Copy link
Copy Markdown
Owner

@pavanjava pavanjava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good

@pavanjava pavanjava merged commit 05be76d into pavanjava:main May 19, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants