Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions README.md
Copy link
Copy Markdown
Collaborator

@srimon12 srimon12 May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 10: The README intro still says quantization support is scalar, binary, product, but this PR adds TURBO as a fourth option. The top-level feature summary should be updated so the repo landing page and package description reflect the actual supported set.
line 87: still describes quantization as scalar/binary/product. Since TURBO is now supported and documented elsewhere in the PR, this should be updated to avoid inconsistency between the README summary and the actual syntax/examples below.

Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
[![MIT License](https://img.shields.io/badge/license-MIT-green)](LICENSE)
[![Tests](https://img.shields.io/badge/tests-375%20passing-brightgreen)](tests/)

Write `INSERT`, `SEARCH`, `RECOMMEND`, `DELETE`, and `CREATE COLLECTION` statements instead of Python SDK calls. Supports hybrid dense+sparse vector search, cross-encoder reranking, quantization (scalar, binary, product), SQL-style `WHERE` filters, script execution, and collection dump/restore.
Write `INSERT`, `SEARCH`, `RECOMMEND`, `DELETE`, and `CREATE COLLECTION` statements instead of Python SDK calls. Supports hybrid dense+sparse vector search, cross-encoder reranking, quantization (scalar, turbo, binary, product), SQL-style `WHERE` filters, script execution, and collection dump/restore.

```
qql> INSERT INTO COLLECTION notes VALUES {'text': 'Qdrant is a vector database', 'author': 'alice', 'year': 2024}
Expand Down Expand Up @@ -84,7 +84,7 @@ Full documentation lives in the [`docs/`](docs/) folder and at **[pavanjava.gith
| [INSERT / INSERT BULK](docs/insert.md) | Adding documents, batch inserts, payload types |
| [SEARCH / RECOMMEND / Hybrid / RERANK](docs/search.md) | Semantic search, hybrid, reranking, recommendations |
| [WHERE Filters](docs/filters.md) | Full SQL-style filter operators |
| [Collections & Quantization](docs/collections.md) | CREATE, DROP, QUANTIZE (scalar/binary/product), CREATE INDEX |
| [Collections & Quantization](docs/collections.md) | CREATE, DROP, QUANTIZE (scalar/turbo/binary/product), CREATE INDEX |
| [Scripts: EXECUTE / DUMP](docs/scripts.md) | Script files, collection backup/restore |
| [Programmatic Usage](docs/programmatic.md) | Use QQL as a Python library |
| [Reference: Models / Config / Errors](docs/reference.md) | Embedding models, config file, error reference |
Expand All @@ -111,6 +111,9 @@ RECOMMEND FROM articles POSITIVE IDS (1001, 1002) LIMIT 5
CREATE COLLECTION articles
CREATE COLLECTION articles HYBRID
CREATE COLLECTION articles QUANTIZE SCALAR
CREATE COLLECTION articles QUANTIZE TURBO
CREATE COLLECTION articles QUANTIZE TURBO BITS 2
CREATE COLLECTION articles QUANTIZE TURBO BITS 1.5 ALWAYS RAM
CREATE INDEX ON COLLECTION articles FOR year TYPE integer
SHOW COLLECTIONS
DROP COLLECTION articles
Expand Down
66 changes: 52 additions & 14 deletions docs/collections.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,27 +67,38 @@ When `USING MODEL` is omitted, the collection uses the **default embedding model

## Quantization — QUANTIZE clause

Quantization reduces the memory footprint of vector collections and speeds up search at the cost of a small, controllable accuracy loss. QQL supports all three Qdrant quantization strategies via an optional `QUANTIZE` clause appended to `CREATE COLLECTION`.
Quantization reduces the memory footprint of vector collections and speeds up search at the cost of a small, controllable accuracy loss. QQL supports all four Qdrant quantization strategies via an optional `QUANTIZE` clause appended to `CREATE COLLECTION`.

**Three strategies:**
**Four strategies:**

| Type | Compression | Accuracy Loss | Best For |
| Type | Compression | Accuracy | Best For |
|---|---|---|---|
| `SCALAR` | 4× (float32 → int8) | < 1% | Most collections — best balance |
| `BINARY` | 32× (float32 → 1-bit) | Higher | High-dimensional vectors (768+), speed priority |
| `SCALAR` | 4× (float32 → int8) | < 1% loss | Most collections — best balance |
| `TURBO` | 8–32× (4-bit to 1-bit) | Low–medium | Better recall than BINARY at same storage budget |
| `BINARY` | 32× (float32 → 1-bit) | Higher loss | Speed priority; centered distributions only |
| `PRODUCT` | 4× (configurable) | Variable | Memory-constrained deployments |

**Full syntax:**
```
CREATE COLLECTION <name> ... QUANTIZE SCALAR [QUANTILE <0.0–1.0>] [ALWAYS RAM]
CREATE COLLECTION <name> ... QUANTIZE TURBO [BITS <1|1.5|2|4>] [ALWAYS RAM]
CREATE COLLECTION <name> ... QUANTIZE BINARY [ALWAYS RAM]
CREATE COLLECTION <name> ... QUANTIZE PRODUCT [ALWAYS RAM]
```

- **`QUANTILE <float>`** — (scalar only) calibration quantile for the INT8 conversion; defaults to Qdrant's built-in default (0.99) when omitted.
- **`ALWAYS RAM`** — keep the **quantized** vectors in RAM at all times, regardless of the collection's `on_disk` setting. Improves search throughput at the cost of higher RAM usage for the compressed index. The original full-precision vectors are stored and managed independently of this flag. Supported by all three quantization types.
- **`QUANTILE <float>`** — (SCALAR only) calibration quantile for the INT8 conversion; defaults to Qdrant's built-in default (0.99) when omitted.
- **`BITS <depth>`** — (TURBO only) bit depth passed to the Qdrant SDK:
- `4` — 4-bit (default when `BITS` is omitted; server applies its own default)
- `2` — 2-bit
- `1.5` — 1.5-bit
- `1` — 1-bit
> Compression ratios (8×, 16×, 24×, 32×) and recall characteristics are
> Qdrant server-side behaviors. QQL maps the `BITS` value to the SDK model and
> passes it to Qdrant; actual results depend on your Qdrant server version.
- **`ALWAYS RAM`** — keep the **quantized** vectors in RAM at all times, regardless of the collection's `on_disk` setting. Improves search throughput at the cost of higher RAM usage for the compressed index. The original full-precision vectors are stored and managed independently of this flag. Supported by all four quantization types.
- **`QUANTIZE`** always appears **after** all other clauses (`HYBRID`, `USING MODEL`, etc.).
- For `PRODUCT`, the compression ratio is fixed at **4×** in this version.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“For PRODUCT, the compression ratio is fixed at 4× in this version” is implementation-specific and matches the current executor, but for TURBO the docs are making stronger behavioral claims than the QQL layer actually enforces. QQL only maps the user input to the SDK model here; the runtime behavior still depends on Qdrant. I’d keep the docs precise about syntax/support and avoid overcommitting on engine behavior unless we intend to validate those guarantees end-to-end.

- For `TURBO`, Cosine, Dot, and Euclidean distance are supported by the Qdrant server when TurboQuant is enabled.
- When used with `HYBRID` collections, quantization applies only to the **dense** vector.

**Examples:**
Expand All @@ -102,6 +113,26 @@ Scalar with explicit calibration and quantized vectors pinned to RAM:
CREATE COLLECTION research_papers QUANTIZE SCALAR QUANTILE 0.95 ALWAYS RAM
```

TurboQuant — default 4-bit (8× compression, good recall):
```sql
CREATE COLLECTION research_papers QUANTIZE TURBO
```

TurboQuant — 2-bit (16× compression):
```sql
CREATE COLLECTION research_papers QUANTIZE TURBO BITS 2
```

TurboQuant — 1.5-bit (24× compression) with quantized vectors pinned to RAM:
```sql
CREATE COLLECTION research_papers QUANTIZE TURBO BITS 1.5 ALWAYS RAM
```

TurboQuant — 1-bit (32× compression, same ratio as BINARY but better recall):
```sql
CREATE COLLECTION research_papers QUANTIZE TURBO BITS 1
```

Binary quantization for large high-dimensional embeddings:
```sql
CREATE COLLECTION research_papers QUANTIZE BINARY
Expand All @@ -115,22 +146,29 @@ CREATE COLLECTION research_papers QUANTIZE PRODUCT ALWAYS RAM
Combined with hybrid collection:
```sql
CREATE COLLECTION research_papers HYBRID QUANTIZE SCALAR
CREATE COLLECTION research_papers HYBRID QUANTIZE TURBO BITS 2
```

Combined with a pinned model:
```sql
CREATE COLLECTION research_papers USING MODEL 'BAAI/bge-base-en-v1.5' QUANTIZE SCALAR QUANTILE 0.99
CREATE COLLECTION research_papers USING MODEL 'BAAI/bge-base-en-v1.5' QUANTIZE TURBO BITS 2
```

Combined with hybrid + dense model:
```sql
CREATE COLLECTION research_papers USING HYBRID DENSE MODEL 'BAAI/bge-base-en-v1.5' QUANTIZE TURBO
```

**Valid combinations:**

| Base form | + QUANTIZE SCALAR | + QUANTIZE BINARY | + QUANTIZE PRODUCT |
|---|---|---|---|
| `CREATE COLLECTION name` | ✓ | ✓ | ✓ |
| `... HYBRID` | ✓ | ✓ | ✓ |
| `... USING MODEL 'x'` | ✓ | ✓ | ✓ |
| `... USING HYBRID` | ✓ | ✓ | ✓ |
| `... USING HYBRID DENSE MODEL 'x'` | ✓ | ✓ | ✓ |
| Base form | + SCALAR | + TURBO | + BINARY | + PRODUCT |
|---|---|---|---|---|
| `CREATE COLLECTION name` | ✓ | ✓ | ✓ | ✓ |
| `... HYBRID` | ✓ | ✓ | ✓ | ✓ |
| `... USING MODEL 'x'` | ✓ | ✓ | ✓ | ✓ |
| `... USING HYBRID` | ✓ | ✓ | ✓ | ✓ |
| `... USING HYBRID DENSE MODEL 'x'` | ✓ | ✓ | ✓ | ✓ |

> INSERT and SEARCH on quantized collections work exactly the same as on non-quantized ones — no changes to INSERT or SEARCH syntax are needed.

Expand Down
6 changes: 3 additions & 3 deletions pyproject.toml
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Package metadata still advertises quantization support as scalar, binary, product. Since this PR adds TURBO, the published package description should be updated too so PyPI metadata stays aligned with the actual feature set.

Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
[project]
name = "qql-cli"
version = "2.0.0"
description = "QQL is a SQL-like query language and CLI for Qdrant vector database. Write INSERT, SEARCH, RECOMMEND, DELETE, and CREATE COLLECTION statements instead of Python SDK calls. Supports hybrid dense+sparse vector search, cross-encoder reranking, quantization (scalar, binary, product), WHERE clause filters, script execution, and collection dump/restore."
version = "2.1.0"
description = "QQL is a SQL-like query language and CLI for Qdrant vector database. Write INSERT, SEARCH, RECOMMEND, DELETE, and CREATE COLLECTION statements instead of Python SDK calls. Supports hybrid dense+sparse vector search, cross-encoder reranking, quantization (scalar, turbo, binary, product), WHERE clause filters, script execution, and collection dump/restore."
readme = "README.md"
license = { file = "LICENSE" }
requires-python = ">=3.12"
Expand Down Expand Up @@ -37,7 +37,7 @@ classifiers = [
"Topic :: Text Processing :: Indexing",
]
dependencies = [
"qdrant-client[fastembed]>=1.13.0",
"qdrant-client[fastembed]>=1.18.0",
"click>=8.1.0",
"rich>=13.0.0",
"prompt_toolkit>=3.0.0",
Expand Down
6 changes: 4 additions & 2 deletions src/qql/ast_nodes.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,16 @@ class QuantizationType(Enum):
SCALAR = "scalar"
BINARY = "binary"
PRODUCT = "product"
TURBO = "turbo"


@dataclass(frozen=True)
class QuantizationConfig:
"""Quantization settings parsed from a QUANTIZE clause."""
type: QuantizationType
quantile: float | None = None # SCALAR only; None → Qdrant default (0.99)
always_ram: bool = False # all types; default False
quantile: float | None = None # SCALAR only; None → Qdrant default (0.99)
always_ram: bool = False # all types; default False
turbo_bits: float | None = None # TURBO only; None → bits4 (Qdrant default 4-bit, 8×)


@dataclass(frozen=True)
Expand Down
27 changes: 26 additions & 1 deletion src/qql/executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,9 @@
ScalarQuantization,
ScalarQuantizationConfig,
ScalarType,
TurboQuantBitSize,
TurboQuantization,
TurboQuantQuantizationConfig,
SearchParams,
SparseVector,
SparseVectorParams,
Expand Down Expand Up @@ -846,7 +849,7 @@ def _wrap_as_filter(self, qdrant_expr: Any) -> Filter:

def _build_quantization_config(
self, qc: QuantizationConfig
) -> ScalarQuantization | BinaryQuantization | ProductQuantization:
) -> ScalarQuantization | BinaryQuantization | ProductQuantization | TurboQuantization:
"""Convert a parsed QuantizationConfig to a Qdrant SDK quantization object."""
if qc.type == QuantizationType.SCALAR:
return ScalarQuantization(
Expand All @@ -867,6 +870,28 @@ def _build_quantization_config(
always_ram=qc.always_ram,
)
)
if qc.type == QuantizationType.TURBO:
_BITS_MAP: dict[float, TurboQuantBitSize] = {
4.0: TurboQuantBitSize.BITS4,
2.0: TurboQuantBitSize.BITS2,
1.5: TurboQuantBitSize.BITS1_5,
1.0: TurboQuantBitSize.BITS1,
}
if qc.turbo_bits is None:
bits_enum = None # user omitted BITS → preserve None, server applies default
elif qc.turbo_bits in _BITS_MAP:
bits_enum = _BITS_MAP[qc.turbo_bits]
else:
raise QQLRuntimeError(
f"Unsupported TURBO bit depth: {qc.turbo_bits}. "
f"Valid values: 1, 1.5, 2, 4"
)
return TurboQuantization(
turbo=TurboQuantQuantizationConfig(
bits=bits_enum,
always_ram=qc.always_ram,
)
)
raise QQLRuntimeError(f"Unknown quantization type: {qc.type}")

def _collection_is_hybrid(self, name: str) -> bool:
Expand Down
4 changes: 4 additions & 0 deletions src/qql/lexer.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ class TokenKind(Enum):
QUANTILE = auto()
ALWAYS = auto()
RAM = auto()
TURBO = auto()
BITS = auto()
CREATE = auto()
INDEX = auto()
ON = auto()
Expand Down Expand Up @@ -113,6 +115,8 @@ class TokenKind(Enum):
"QUANTILE": TokenKind.QUANTILE,
"ALWAYS": TokenKind.ALWAYS,
"RAM": TokenKind.RAM,
"TURBO": TokenKind.TURBO,
"BITS": TokenKind.BITS,
"CREATE": TokenKind.CREATE,
"INDEX": TokenKind.INDEX,
"ON": TokenKind.ON,
Expand Down
26 changes: 25 additions & 1 deletion src/qql/parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -248,8 +248,32 @@ def _parse_quantize_clause(self) -> QuantizationConfig:
always_ram = True
return QuantizationConfig(type=QuantizationType.PRODUCT, always_ram=always_ram)

if tok.kind == TokenKind.TURBO:
self._advance()
turbo_bits: float | None = None
always_ram = False
if self._peek().kind == TokenKind.BITS:
self._advance()
bits_tok = self._peek()
raw = float(self._parse_number())
if raw not in (1.0, 1.5, 2.0, 4.0):
raise QQLSyntaxError(
f"BITS must be one of 1, 1.5, 2, or 4 for TURBO quantization, got {raw}",
bits_tok.pos,
)
turbo_bits = raw
if self._peek().kind == TokenKind.ALWAYS:
self._advance()
self._expect(TokenKind.RAM)
always_ram = True
return QuantizationConfig(
type=QuantizationType.TURBO,
turbo_bits=turbo_bits,
always_ram=always_ram,
)

raise QQLSyntaxError(
f"Expected SCALAR, BINARY, or PRODUCT after QUANTIZE, got '{tok.value}'",
f"Expected SCALAR, BINARY, PRODUCT, or TURBO after QUANTIZE, got '{tok.value}'",
tok.pos,
)

Expand Down
107 changes: 107 additions & 0 deletions tests/test_executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -1640,3 +1640,110 @@ def test_result_message_no_quantization_suffix_when_absent(self, executor, mock_
node = CreateCollectionStmt(collection="articles")
result = executor.execute(node)
assert "quantization" not in result.message


class TestTurboQuantCreate:
"""Executor tests for QUANTIZE TURBO — verifies correct SDK objects are built."""

@pytest.fixture
def executor(self, cfg, mock_client):
return Executor(mock_client, cfg)

# ── TurboQuantization object is produced ──────────────────────────────

def test_turbo_passes_turbo_quantization(self, executor, mock_client):
from qdrant_client.models import TurboQuantization
node = CreateCollectionStmt(
collection="articles",
quantization=QuantizationConfig(type=QuantizationType.TURBO),
)
executor.execute(node)
kw = mock_client.create_collection.call_args.kwargs
assert isinstance(kw.get("quantization_config"), TurboQuantization)

def test_turbo_default_bits_is_none(self, executor, mock_client):
"""When BITS is omitted, bits must be None — preserving omission so the
SDK/server applies its own default rather than QQL forcing BITS4."""
node = CreateCollectionStmt(
collection="articles",
quantization=QuantizationConfig(type=QuantizationType.TURBO),
)
executor.execute(node)
kw = mock_client.create_collection.call_args.kwargs
assert kw["quantization_config"].turbo.bits is None

def test_turbo_bits2(self, executor, mock_client):
from qdrant_client.models import TurboQuantBitSize
node = CreateCollectionStmt(
collection="articles",
quantization=QuantizationConfig(type=QuantizationType.TURBO, turbo_bits=2.0),
)
executor.execute(node)
kw = mock_client.create_collection.call_args.kwargs
assert kw["quantization_config"].turbo.bits == TurboQuantBitSize.BITS2

def test_turbo_bits1_5(self, executor, mock_client):
from qdrant_client.models import TurboQuantBitSize
node = CreateCollectionStmt(
collection="articles",
quantization=QuantizationConfig(type=QuantizationType.TURBO, turbo_bits=1.5),
)
executor.execute(node)
kw = mock_client.create_collection.call_args.kwargs
assert kw["quantization_config"].turbo.bits == TurboQuantBitSize.BITS1_5

def test_turbo_bits1(self, executor, mock_client):
from qdrant_client.models import TurboQuantBitSize
node = CreateCollectionStmt(
collection="articles",
quantization=QuantizationConfig(type=QuantizationType.TURBO, turbo_bits=1.0),
)
executor.execute(node)
kw = mock_client.create_collection.call_args.kwargs
assert kw["quantization_config"].turbo.bits == TurboQuantBitSize.BITS1

def test_turbo_always_ram_true(self, executor, mock_client):
node = CreateCollectionStmt(
collection="articles",
quantization=QuantizationConfig(type=QuantizationType.TURBO, always_ram=True),
)
executor.execute(node)
kw = mock_client.create_collection.call_args.kwargs
assert kw["quantization_config"].turbo.always_ram is True

def test_turbo_always_ram_false_by_default(self, executor, mock_client):
node = CreateCollectionStmt(
collection="articles",
quantization=QuantizationConfig(type=QuantizationType.TURBO),
)
executor.execute(node)
kw = mock_client.create_collection.call_args.kwargs
assert kw["quantization_config"].turbo.always_ram is False

def test_turbo_hybrid_collection_has_both_configs(self, executor, mock_client):
from qdrant_client.models import TurboQuantization
node = CreateCollectionStmt(
collection="articles",
hybrid=True,
quantization=QuantizationConfig(type=QuantizationType.TURBO),
)
executor.execute(node)
kw = mock_client.create_collection.call_args.kwargs
assert isinstance(kw.get("quantization_config"), TurboQuantization)
assert "sparse_vectors_config" in kw

def test_turbo_result_message_includes_turbo(self, executor, mock_client):
node = CreateCollectionStmt(
collection="articles",
quantization=QuantizationConfig(type=QuantizationType.TURBO),
)
result = executor.execute(node)
assert "turbo" in result.message

def test_turbo_invalid_bits_at_executor_raises(self, executor, mock_client):
"""An unexpected turbo_bits value that bypasses parser validation must
raise QQLRuntimeError explicitly instead of silently coercing to BITS4."""
from qql.exceptions import QQLRuntimeError as QQLErr
qc = QuantizationConfig(type=QuantizationType.TURBO, turbo_bits=3.0)
with pytest.raises(QQLErr, match="Unsupported TURBO bit depth"):
executor._build_quantization_config(qc)
Loading
Loading