diff --git a/README.md b/README.md index 76f416c..d530f7c 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,7 @@ [![MIT License](https://img.shields.io/badge/license-MIT-green)](LICENSE) [![Tests](https://img.shields.io/badge/tests-375%20passing-brightgreen)](tests/) -Write `INSERT`, `SEARCH`, `RECOMMEND`, `DELETE`, and `CREATE COLLECTION` statements instead of Python SDK calls. Supports hybrid dense+sparse vector search, cross-encoder reranking, quantization (scalar, binary, product), SQL-style `WHERE` filters, script execution, and collection dump/restore. +Write `INSERT`, `SEARCH`, `RECOMMEND`, `DELETE`, and `CREATE COLLECTION` statements instead of Python SDK calls. Supports hybrid dense+sparse vector search, cross-encoder reranking, quantization (scalar, turbo, binary, product), SQL-style `WHERE` filters, script execution, and collection dump/restore. ``` qql> INSERT INTO COLLECTION notes VALUES {'text': 'Qdrant is a vector database', 'author': 'alice', 'year': 2024} @@ -84,7 +84,7 @@ Full documentation lives in the [`docs/`](docs/) folder and at **[pavanjava.gith | [INSERT / INSERT BULK](docs/insert.md) | Adding documents, batch inserts, payload types | | [SEARCH / RECOMMEND / Hybrid / RERANK](docs/search.md) | Semantic search, hybrid, reranking, recommendations | | [WHERE Filters](docs/filters.md) | Full SQL-style filter operators | -| [Collections & Quantization](docs/collections.md) | CREATE, DROP, QUANTIZE (scalar/binary/product), CREATE INDEX | +| [Collections & Quantization](docs/collections.md) | CREATE, DROP, QUANTIZE (scalar/turbo/binary/product), CREATE INDEX | | [Scripts: EXECUTE / DUMP](docs/scripts.md) | Script files, collection backup/restore | | [Programmatic Usage](docs/programmatic.md) | Use QQL as a Python library | | [Reference: Models / Config / Errors](docs/reference.md) | Embedding models, config file, error reference | @@ -111,6 +111,9 @@ RECOMMEND FROM articles POSITIVE IDS (1001, 1002) LIMIT 5 CREATE COLLECTION articles CREATE COLLECTION articles HYBRID CREATE COLLECTION articles QUANTIZE SCALAR +CREATE COLLECTION articles QUANTIZE TURBO +CREATE COLLECTION articles QUANTIZE TURBO BITS 2 +CREATE COLLECTION articles QUANTIZE TURBO BITS 1.5 ALWAYS RAM CREATE INDEX ON COLLECTION articles FOR year TYPE integer SHOW COLLECTIONS DROP COLLECTION articles diff --git a/docs/collections.md b/docs/collections.md index 8082bc1..784cf6d 100644 --- a/docs/collections.md +++ b/docs/collections.md @@ -67,27 +67,38 @@ When `USING MODEL` is omitted, the collection uses the **default embedding model ## Quantization — QUANTIZE clause -Quantization reduces the memory footprint of vector collections and speeds up search at the cost of a small, controllable accuracy loss. QQL supports all three Qdrant quantization strategies via an optional `QUANTIZE` clause appended to `CREATE COLLECTION`. +Quantization reduces the memory footprint of vector collections and speeds up search at the cost of a small, controllable accuracy loss. QQL supports all four Qdrant quantization strategies via an optional `QUANTIZE` clause appended to `CREATE COLLECTION`. -**Three strategies:** +**Four strategies:** -| Type | Compression | Accuracy Loss | Best For | +| Type | Compression | Accuracy | Best For | |---|---|---|---| -| `SCALAR` | 4× (float32 → int8) | < 1% | Most collections — best balance | -| `BINARY` | 32× (float32 → 1-bit) | Higher | High-dimensional vectors (768+), speed priority | +| `SCALAR` | 4× (float32 → int8) | < 1% loss | Most collections — best balance | +| `TURBO` | 8–32× (4-bit to 1-bit) | Low–medium | Better recall than BINARY at same storage budget | +| `BINARY` | 32× (float32 → 1-bit) | Higher loss | Speed priority; centered distributions only | | `PRODUCT` | 4× (configurable) | Variable | Memory-constrained deployments | **Full syntax:** ``` CREATE COLLECTION ... QUANTIZE SCALAR [QUANTILE <0.0–1.0>] [ALWAYS RAM] +CREATE COLLECTION ... QUANTIZE TURBO [BITS <1|1.5|2|4>] [ALWAYS RAM] CREATE COLLECTION ... QUANTIZE BINARY [ALWAYS RAM] CREATE COLLECTION ... QUANTIZE PRODUCT [ALWAYS RAM] ``` -- **`QUANTILE `** — (scalar only) calibration quantile for the INT8 conversion; defaults to Qdrant's built-in default (0.99) when omitted. -- **`ALWAYS RAM`** — keep the **quantized** vectors in RAM at all times, regardless of the collection's `on_disk` setting. Improves search throughput at the cost of higher RAM usage for the compressed index. The original full-precision vectors are stored and managed independently of this flag. Supported by all three quantization types. +- **`QUANTILE `** — (SCALAR only) calibration quantile for the INT8 conversion; defaults to Qdrant's built-in default (0.99) when omitted. +- **`BITS `** — (TURBO only) bit depth passed to the Qdrant SDK: + - `4` — 4-bit (default when `BITS` is omitted; server applies its own default) + - `2` — 2-bit + - `1.5` — 1.5-bit + - `1` — 1-bit + > Compression ratios (8×, 16×, 24×, 32×) and recall characteristics are + > Qdrant server-side behaviors. QQL maps the `BITS` value to the SDK model and + > passes it to Qdrant; actual results depend on your Qdrant server version. +- **`ALWAYS RAM`** — keep the **quantized** vectors in RAM at all times, regardless of the collection's `on_disk` setting. Improves search throughput at the cost of higher RAM usage for the compressed index. The original full-precision vectors are stored and managed independently of this flag. Supported by all four quantization types. - **`QUANTIZE`** always appears **after** all other clauses (`HYBRID`, `USING MODEL`, etc.). - For `PRODUCT`, the compression ratio is fixed at **4×** in this version. +- For `TURBO`, Cosine, Dot, and Euclidean distance are supported by the Qdrant server when TurboQuant is enabled. - When used with `HYBRID` collections, quantization applies only to the **dense** vector. **Examples:** @@ -102,6 +113,26 @@ Scalar with explicit calibration and quantized vectors pinned to RAM: CREATE COLLECTION research_papers QUANTIZE SCALAR QUANTILE 0.95 ALWAYS RAM ``` +TurboQuant — default 4-bit (8× compression, good recall): +```sql +CREATE COLLECTION research_papers QUANTIZE TURBO +``` + +TurboQuant — 2-bit (16× compression): +```sql +CREATE COLLECTION research_papers QUANTIZE TURBO BITS 2 +``` + +TurboQuant — 1.5-bit (24× compression) with quantized vectors pinned to RAM: +```sql +CREATE COLLECTION research_papers QUANTIZE TURBO BITS 1.5 ALWAYS RAM +``` + +TurboQuant — 1-bit (32× compression, same ratio as BINARY but better recall): +```sql +CREATE COLLECTION research_papers QUANTIZE TURBO BITS 1 +``` + Binary quantization for large high-dimensional embeddings: ```sql CREATE COLLECTION research_papers QUANTIZE BINARY @@ -115,22 +146,29 @@ CREATE COLLECTION research_papers QUANTIZE PRODUCT ALWAYS RAM Combined with hybrid collection: ```sql CREATE COLLECTION research_papers HYBRID QUANTIZE SCALAR +CREATE COLLECTION research_papers HYBRID QUANTIZE TURBO BITS 2 ``` Combined with a pinned model: ```sql CREATE COLLECTION research_papers USING MODEL 'BAAI/bge-base-en-v1.5' QUANTIZE SCALAR QUANTILE 0.99 +CREATE COLLECTION research_papers USING MODEL 'BAAI/bge-base-en-v1.5' QUANTIZE TURBO BITS 2 +``` + +Combined with hybrid + dense model: +```sql +CREATE COLLECTION research_papers USING HYBRID DENSE MODEL 'BAAI/bge-base-en-v1.5' QUANTIZE TURBO ``` **Valid combinations:** -| Base form | + QUANTIZE SCALAR | + QUANTIZE BINARY | + QUANTIZE PRODUCT | -|---|---|---|---| -| `CREATE COLLECTION name` | ✓ | ✓ | ✓ | -| `... HYBRID` | ✓ | ✓ | ✓ | -| `... USING MODEL 'x'` | ✓ | ✓ | ✓ | -| `... USING HYBRID` | ✓ | ✓ | ✓ | -| `... USING HYBRID DENSE MODEL 'x'` | ✓ | ✓ | ✓ | +| Base form | + SCALAR | + TURBO | + BINARY | + PRODUCT | +|---|---|---|---|---| +| `CREATE COLLECTION name` | ✓ | ✓ | ✓ | ✓ | +| `... HYBRID` | ✓ | ✓ | ✓ | ✓ | +| `... USING MODEL 'x'` | ✓ | ✓ | ✓ | ✓ | +| `... USING HYBRID` | ✓ | ✓ | ✓ | ✓ | +| `... USING HYBRID DENSE MODEL 'x'` | ✓ | ✓ | ✓ | ✓ | > INSERT and SEARCH on quantized collections work exactly the same as on non-quantized ones — no changes to INSERT or SEARCH syntax are needed. diff --git a/pyproject.toml b/pyproject.toml index 9aec66b..fcd088b 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,7 +1,7 @@ [project] name = "qql-cli" -version = "2.0.0" -description = "QQL is a SQL-like query language and CLI for Qdrant vector database. Write INSERT, SEARCH, RECOMMEND, DELETE, and CREATE COLLECTION statements instead of Python SDK calls. Supports hybrid dense+sparse vector search, cross-encoder reranking, quantization (scalar, binary, product), WHERE clause filters, script execution, and collection dump/restore." +version = "2.1.0" +description = "QQL is a SQL-like query language and CLI for Qdrant vector database. Write INSERT, SEARCH, RECOMMEND, DELETE, and CREATE COLLECTION statements instead of Python SDK calls. Supports hybrid dense+sparse vector search, cross-encoder reranking, quantization (scalar, turbo, binary, product), WHERE clause filters, script execution, and collection dump/restore." readme = "README.md" license = { file = "LICENSE" } requires-python = ">=3.12" @@ -37,7 +37,7 @@ classifiers = [ "Topic :: Text Processing :: Indexing", ] dependencies = [ - "qdrant-client[fastembed]>=1.13.0", + "qdrant-client[fastembed]>=1.18.0", "click>=8.1.0", "rich>=13.0.0", "prompt_toolkit>=3.0.0", diff --git a/src/qql/ast_nodes.py b/src/qql/ast_nodes.py index 5aa0562..b9f8b50 100644 --- a/src/qql/ast_nodes.py +++ b/src/qql/ast_nodes.py @@ -9,14 +9,16 @@ class QuantizationType(Enum): SCALAR = "scalar" BINARY = "binary" PRODUCT = "product" + TURBO = "turbo" @dataclass(frozen=True) class QuantizationConfig: """Quantization settings parsed from a QUANTIZE clause.""" type: QuantizationType - quantile: float | None = None # SCALAR only; None → Qdrant default (0.99) - always_ram: bool = False # all types; default False + quantile: float | None = None # SCALAR only; None → Qdrant default (0.99) + always_ram: bool = False # all types; default False + turbo_bits: float | None = None # TURBO only; None → bits4 (Qdrant default 4-bit, 8×) @dataclass(frozen=True) diff --git a/src/qql/executor.py b/src/qql/executor.py index f43b2d8..452eb57 100644 --- a/src/qql/executor.py +++ b/src/qql/executor.py @@ -41,6 +41,9 @@ ScalarQuantization, ScalarQuantizationConfig, ScalarType, + TurboQuantBitSize, + TurboQuantization, + TurboQuantQuantizationConfig, SearchParams, SparseVector, SparseVectorParams, @@ -846,7 +849,7 @@ def _wrap_as_filter(self, qdrant_expr: Any) -> Filter: def _build_quantization_config( self, qc: QuantizationConfig - ) -> ScalarQuantization | BinaryQuantization | ProductQuantization: + ) -> ScalarQuantization | BinaryQuantization | ProductQuantization | TurboQuantization: """Convert a parsed QuantizationConfig to a Qdrant SDK quantization object.""" if qc.type == QuantizationType.SCALAR: return ScalarQuantization( @@ -867,6 +870,28 @@ def _build_quantization_config( always_ram=qc.always_ram, ) ) + if qc.type == QuantizationType.TURBO: + _BITS_MAP: dict[float, TurboQuantBitSize] = { + 4.0: TurboQuantBitSize.BITS4, + 2.0: TurboQuantBitSize.BITS2, + 1.5: TurboQuantBitSize.BITS1_5, + 1.0: TurboQuantBitSize.BITS1, + } + if qc.turbo_bits is None: + bits_enum = None # user omitted BITS → preserve None, server applies default + elif qc.turbo_bits in _BITS_MAP: + bits_enum = _BITS_MAP[qc.turbo_bits] + else: + raise QQLRuntimeError( + f"Unsupported TURBO bit depth: {qc.turbo_bits}. " + f"Valid values: 1, 1.5, 2, 4" + ) + return TurboQuantization( + turbo=TurboQuantQuantizationConfig( + bits=bits_enum, + always_ram=qc.always_ram, + ) + ) raise QQLRuntimeError(f"Unknown quantization type: {qc.type}") def _collection_is_hybrid(self, name: str) -> bool: diff --git a/src/qql/lexer.py b/src/qql/lexer.py index 56ed1c7..ca8437e 100644 --- a/src/qql/lexer.py +++ b/src/qql/lexer.py @@ -27,6 +27,8 @@ class TokenKind(Enum): QUANTILE = auto() ALWAYS = auto() RAM = auto() + TURBO = auto() + BITS = auto() CREATE = auto() INDEX = auto() ON = auto() @@ -113,6 +115,8 @@ class TokenKind(Enum): "QUANTILE": TokenKind.QUANTILE, "ALWAYS": TokenKind.ALWAYS, "RAM": TokenKind.RAM, + "TURBO": TokenKind.TURBO, + "BITS": TokenKind.BITS, "CREATE": TokenKind.CREATE, "INDEX": TokenKind.INDEX, "ON": TokenKind.ON, diff --git a/src/qql/parser.py b/src/qql/parser.py index 42f9fa2..ef5e6fc 100644 --- a/src/qql/parser.py +++ b/src/qql/parser.py @@ -248,8 +248,32 @@ def _parse_quantize_clause(self) -> QuantizationConfig: always_ram = True return QuantizationConfig(type=QuantizationType.PRODUCT, always_ram=always_ram) + if tok.kind == TokenKind.TURBO: + self._advance() + turbo_bits: float | None = None + always_ram = False + if self._peek().kind == TokenKind.BITS: + self._advance() + bits_tok = self._peek() + raw = float(self._parse_number()) + if raw not in (1.0, 1.5, 2.0, 4.0): + raise QQLSyntaxError( + f"BITS must be one of 1, 1.5, 2, or 4 for TURBO quantization, got {raw}", + bits_tok.pos, + ) + turbo_bits = raw + if self._peek().kind == TokenKind.ALWAYS: + self._advance() + self._expect(TokenKind.RAM) + always_ram = True + return QuantizationConfig( + type=QuantizationType.TURBO, + turbo_bits=turbo_bits, + always_ram=always_ram, + ) + raise QQLSyntaxError( - f"Expected SCALAR, BINARY, or PRODUCT after QUANTIZE, got '{tok.value}'", + f"Expected SCALAR, BINARY, PRODUCT, or TURBO after QUANTIZE, got '{tok.value}'", tok.pos, ) diff --git a/tests/test_executor.py b/tests/test_executor.py index 11100d2..d5408d8 100644 --- a/tests/test_executor.py +++ b/tests/test_executor.py @@ -1640,3 +1640,110 @@ def test_result_message_no_quantization_suffix_when_absent(self, executor, mock_ node = CreateCollectionStmt(collection="articles") result = executor.execute(node) assert "quantization" not in result.message + + +class TestTurboQuantCreate: + """Executor tests for QUANTIZE TURBO — verifies correct SDK objects are built.""" + + @pytest.fixture + def executor(self, cfg, mock_client): + return Executor(mock_client, cfg) + + # ── TurboQuantization object is produced ────────────────────────────── + + def test_turbo_passes_turbo_quantization(self, executor, mock_client): + from qdrant_client.models import TurboQuantization + node = CreateCollectionStmt( + collection="articles", + quantization=QuantizationConfig(type=QuantizationType.TURBO), + ) + executor.execute(node) + kw = mock_client.create_collection.call_args.kwargs + assert isinstance(kw.get("quantization_config"), TurboQuantization) + + def test_turbo_default_bits_is_none(self, executor, mock_client): + """When BITS is omitted, bits must be None — preserving omission so the + SDK/server applies its own default rather than QQL forcing BITS4.""" + node = CreateCollectionStmt( + collection="articles", + quantization=QuantizationConfig(type=QuantizationType.TURBO), + ) + executor.execute(node) + kw = mock_client.create_collection.call_args.kwargs + assert kw["quantization_config"].turbo.bits is None + + def test_turbo_bits2(self, executor, mock_client): + from qdrant_client.models import TurboQuantBitSize + node = CreateCollectionStmt( + collection="articles", + quantization=QuantizationConfig(type=QuantizationType.TURBO, turbo_bits=2.0), + ) + executor.execute(node) + kw = mock_client.create_collection.call_args.kwargs + assert kw["quantization_config"].turbo.bits == TurboQuantBitSize.BITS2 + + def test_turbo_bits1_5(self, executor, mock_client): + from qdrant_client.models import TurboQuantBitSize + node = CreateCollectionStmt( + collection="articles", + quantization=QuantizationConfig(type=QuantizationType.TURBO, turbo_bits=1.5), + ) + executor.execute(node) + kw = mock_client.create_collection.call_args.kwargs + assert kw["quantization_config"].turbo.bits == TurboQuantBitSize.BITS1_5 + + def test_turbo_bits1(self, executor, mock_client): + from qdrant_client.models import TurboQuantBitSize + node = CreateCollectionStmt( + collection="articles", + quantization=QuantizationConfig(type=QuantizationType.TURBO, turbo_bits=1.0), + ) + executor.execute(node) + kw = mock_client.create_collection.call_args.kwargs + assert kw["quantization_config"].turbo.bits == TurboQuantBitSize.BITS1 + + def test_turbo_always_ram_true(self, executor, mock_client): + node = CreateCollectionStmt( + collection="articles", + quantization=QuantizationConfig(type=QuantizationType.TURBO, always_ram=True), + ) + executor.execute(node) + kw = mock_client.create_collection.call_args.kwargs + assert kw["quantization_config"].turbo.always_ram is True + + def test_turbo_always_ram_false_by_default(self, executor, mock_client): + node = CreateCollectionStmt( + collection="articles", + quantization=QuantizationConfig(type=QuantizationType.TURBO), + ) + executor.execute(node) + kw = mock_client.create_collection.call_args.kwargs + assert kw["quantization_config"].turbo.always_ram is False + + def test_turbo_hybrid_collection_has_both_configs(self, executor, mock_client): + from qdrant_client.models import TurboQuantization + node = CreateCollectionStmt( + collection="articles", + hybrid=True, + quantization=QuantizationConfig(type=QuantizationType.TURBO), + ) + executor.execute(node) + kw = mock_client.create_collection.call_args.kwargs + assert isinstance(kw.get("quantization_config"), TurboQuantization) + assert "sparse_vectors_config" in kw + + def test_turbo_result_message_includes_turbo(self, executor, mock_client): + node = CreateCollectionStmt( + collection="articles", + quantization=QuantizationConfig(type=QuantizationType.TURBO), + ) + result = executor.execute(node) + assert "turbo" in result.message + + def test_turbo_invalid_bits_at_executor_raises(self, executor, mock_client): + """An unexpected turbo_bits value that bypasses parser validation must + raise QQLRuntimeError explicitly instead of silently coercing to BITS4.""" + from qql.exceptions import QQLRuntimeError as QQLErr + qc = QuantizationConfig(type=QuantizationType.TURBO, turbo_bits=3.0) + with pytest.raises(QQLErr, match="Unsupported TURBO bit depth"): + executor._build_quantization_config(qc) diff --git a/tests/test_parser.py b/tests/test_parser.py index e0e07d5..e229bf3 100644 --- a/tests/test_parser.py +++ b/tests/test_parser.py @@ -1031,3 +1031,83 @@ def test_scalar_quantile_above_one_raises(self): def test_scalar_quantile_integer_above_one_raises(self): with pytest.raises(QQLSyntaxError): parse("CREATE COLLECTION articles QUANTIZE SCALAR QUANTILE 2") + + +class TestTurboQuantCreate: + """Parser tests for QUANTIZE TURBO [BITS n] [ALWAYS RAM].""" + + # ── Default / no options ────────────────────────────────────────────── + + def test_turbo_no_options(self): + node = parse("CREATE COLLECTION articles QUANTIZE TURBO") + assert node.quantization is not None + assert node.quantization.type == QuantizationType.TURBO + assert node.quantization.turbo_bits is None + assert node.quantization.always_ram is False + + # ── BITS variants ───────────────────────────────────────────────────── + + def test_turbo_bits4(self): + node = parse("CREATE COLLECTION articles QUANTIZE TURBO BITS 4") + assert node.quantization.type == QuantizationType.TURBO + assert node.quantization.turbo_bits == 4.0 + + def test_turbo_bits2(self): + node = parse("CREATE COLLECTION articles QUANTIZE TURBO BITS 2") + assert node.quantization.turbo_bits == 2.0 + + def test_turbo_bits1_5(self): + node = parse("CREATE COLLECTION articles QUANTIZE TURBO BITS 1.5") + assert node.quantization.turbo_bits == 1.5 + + def test_turbo_bits1(self): + node = parse("CREATE COLLECTION articles QUANTIZE TURBO BITS 1") + assert node.quantization.turbo_bits == 1.0 + + # ── ALWAYS RAM ──────────────────────────────────────────────────────── + + def test_turbo_always_ram_no_bits(self): + node = parse("CREATE COLLECTION articles QUANTIZE TURBO ALWAYS RAM") + assert node.quantization.type == QuantizationType.TURBO + assert node.quantization.always_ram is True + assert node.quantization.turbo_bits is None + + def test_turbo_bits_and_always_ram(self): + node = parse("CREATE COLLECTION articles QUANTIZE TURBO BITS 2 ALWAYS RAM") + assert node.quantization.turbo_bits == 2.0 + assert node.quantization.always_ram is True + + # ── Composed with other clauses ─────────────────────────────────────── + + def test_turbo_with_hybrid_shorthand(self): + node = parse("CREATE COLLECTION articles HYBRID QUANTIZE TURBO") + assert node.hybrid is True + assert node.quantization.type == QuantizationType.TURBO + + def test_turbo_with_using_hybrid(self): + node = parse("CREATE COLLECTION articles USING HYBRID QUANTIZE TURBO BITS 2") + assert node.hybrid is True + assert node.quantization.turbo_bits == 2.0 + + def test_turbo_with_model(self): + node = parse("CREATE COLLECTION articles USING MODEL 'BAAI/bge-base-en-v1.5' QUANTIZE TURBO BITS 1.5") + assert node.model == "BAAI/bge-base-en-v1.5" + assert node.quantization.type == QuantizationType.TURBO + assert node.quantization.turbo_bits == 1.5 + + def test_turbo_with_hybrid_dense_model(self): + node = parse("CREATE COLLECTION articles USING HYBRID DENSE MODEL 'x' QUANTIZE TURBO BITS 1 ALWAYS RAM") + assert node.hybrid is True + assert node.model == "x" + assert node.quantization.turbo_bits == 1.0 + assert node.quantization.always_ram is True + + # ── Error cases ─────────────────────────────────────────────────────── + + def test_turbo_invalid_bits_raises(self): + with pytest.raises(QQLSyntaxError): + parse("CREATE COLLECTION articles QUANTIZE TURBO BITS 3") + + def test_turbo_invalid_bits_float_raises(self): + with pytest.raises(QQLSyntaxError): + parse("CREATE COLLECTION articles QUANTIZE TURBO BITS 0.5")